linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] pmem, dax: I/O path enhancements
@ 2015-08-06 17:43 Ross Zwisler
  2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Alexander Viro, Borislav Petkov, H. Peter Anvin,
	Ingo Molnar, Juergen Gross, Len Brown, linux-acpi, linux-fsdevel,
	Luis R. Rodriguez, Matthew Wilcox, Rafael J. Wysocki,
	Thomas Gleixner, Toshi Kani, x86

Patch 5 adds support for the "read flush" _DSM flag, allowing us to change the
ND BLK aperture mapping from write-combining to write-back via memremap_pmem().

Patch 6 updates the DAX I/O path so that all operations that store data (I/O
writes, zeroing blocks, punching holes, etc.) properly synchronize the stores
to media using the PMEM API.  This ensures that the data DAX is writing is
durable on media before the operation completes.

Patches 1-4 are cleanup patches and additions to the PMEM API that make
patches 5 and 6 possible.

Regarding the choice to add both flush_cache_pmem() and wb_cache_pmem() to the
PMEM API, I had initially implemented flush_cache_pmem() as a generic function
flush_io_cache_range() in the spirit of flush_cache_range(), etc., in
cacheflush.h.  I eventually moved it into the PMEM API because a) it has a
common and consistent use of the __pmem annotation, b) it has a clear fallback
method for architectures that don't support it, as opposed to APIs in
cacheflush.h which would need to be added individually to all other
architectures.  It can be argued that the flush API could apply to other uses
beyond PMEM such as flushing cache lines associated with other types of
sliding MMIO windows.  At this point I'm inclined to have it as part of the
PMEM API, and then take on the effort of making it a general cache flusing API
if other users come along.

Ross Zwisler (6):
  pmem: remove indirection layer arch_has_pmem_api()
  x86: clean up conditional pmem includes
  x86: add clwb_cache_range()
  pmem: Add wb_cache_pmem() and flush_cache_pmem()
  nd_blk: add support for "read flush" DSM flag
  dax: update I/O path to do proper PMEM flushing

 arch/x86/include/asm/cacheflush.h | 24 +++++++++--------
 arch/x86/mm/pageattr.c            | 23 ++++++++++++++++
 drivers/acpi/nfit.c               | 18 ++++++-------
 drivers/acpi/nfit.h               |  6 ++++-
 fs/dax.c                          | 55 +++++++++++++++++++++++++++++++--------
 include/linux/pmem.h              | 36 ++++++++++++++++++-------
 6 files changed, 120 insertions(+), 42 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-07  6:38   ` Christoph Hellwig
  2015-08-07 16:14   ` Dan Williams
  2015-08-06 17:43 ` [PATCH 2/6] x86: clean up conditional pmem includes Ross Zwisler
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams; +Cc: Ross Zwisler

Prior to this change arch_has_wmb_pmem() was only called by
arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.

Instead, remove one extra layer of indirection and the redundant
CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
call __arch_has_wmb_pmem() directly.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pmem.h | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 03f9d73..fb62ce4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -58,7 +58,7 @@ static inline void memunmap_pmem(void __pmem *addr)
 }
 
 /**
- * arch_has_wmb_pmem - true if wmb_pmem() ensures durability
+ * arch_has_pmem_api - true if wmb_pmem() ensures durability
  *
  * For a given cpu implementation within an architecture it is possible
  * that wmb_pmem() resolves to a nop.  In the case this returns
@@ -66,16 +66,9 @@ static inline void memunmap_pmem(void __pmem *addr)
  * fall back to a different data consistency model, or otherwise notify
  * the user.
  */
-static inline bool arch_has_wmb_pmem(void)
-{
-	if (IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
-		return __arch_has_wmb_pmem();
-	return false;
-}
-
 static inline bool arch_has_pmem_api(void)
 {
-	return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && arch_has_wmb_pmem();
+	return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && __arch_has_wmb_pmem();
 }
 
 /*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] x86: clean up conditional pmem includes
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
  2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-07  6:39   ` Christoph Hellwig
  2015-08-06 17:43 ` [PATCH 3/6] x86: add clwb_cache_range() Ross Zwisler
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, Toshi Kani, Juergen Gross

Prior to this change x86_64 used the pmem defines in
arch/x86/include/asm/cacheflush.h, and UM used the default ones at the
top of include/linux/pmem.h.  The inclusion or exclusion in pmem.h was
controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in cacheflush.h
were controlled by ARCH_HAS_NOCACHE_UACCESS.

Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
clear that they are related and we don't run into the possibility where
they are both included or excluded.  Also remove a bunch of stale
function prototypes meant for UM in cacheflush.h - these just conflicted
with the inline defaults in pmem.h, and gave compile errors:

include/linux/pmem.h:21:91: error: static declaration of
	'arch_wmb_pmem' follows non-static declaration
static inline void arch_wmb_pmem(void)
In file included from include/linux/highmem.h:11:0,
				 from include/linux/pagemap.h:10,
				 from include/linux/blkdev.h:14,
				 from fs/dax.c:18:
./arch/x86/include/asm/cacheflush.h:185:13: note: previous declaration of
	'arch_wmb_pmem' was here
 extern void arch_wmb_pmem(void);

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/cacheflush.h | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 9bf3ea1..ae00766 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -109,8 +109,7 @@ static inline int rodata_test(void)
 }
 #endif
 
-#ifdef ARCH_HAS_NOCACHE_UACCESS
-
+#ifdef CONFIG_ARCH_HAS_PMEM_API
 /**
  * arch_memcpy_to_pmem - copy data to persistent memory
  * @dst: destination buffer for the copy
@@ -170,14 +169,6 @@ static inline bool __arch_has_wmb_pmem(void)
 	return false;
 #endif
 }
-#else /* ARCH_HAS_NOCACHE_UACCESS i.e. ARCH=um */
-extern void arch_memcpy_to_pmem(void __pmem *dst, const void *src, size_t n);
-extern void arch_wmb_pmem(void);
-
-static inline bool __arch_has_wmb_pmem(void)
-{
-	return false;
-}
-#endif
+#endif /* CONFIG_ARCH_HAS_PMEM_API */
 
 #endif /* _ASM_X86_CACHEFLUSH_H */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] x86: add clwb_cache_range()
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
  2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
  2015-08-06 17:43 ` [PATCH 2/6] x86: clean up conditional pmem includes Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-06 17:43 ` [PATCH 4/6] pmem: Add wb_cache_pmem() and flush_cache_pmem() Ross Zwisler
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Juergen Gross, Borislav Petkov, Toshi Kani, Luis R. Rodriguez

Add support for writing back a cache range using CLWB instead of just
flushing it using CLFLUSH or CLFLUSHOPT.  This allows you to ensure that
your data has become durable on your DIMM, but potentially leaves a
clean version in the processor cache hierarchy for future loads.

This will be used in DAX to write back stores to persistent memory.

Details on CLWB can be found here:

https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/cacheflush.h |  1 +
 arch/x86/mm/pageattr.c            | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index ae00766..490b3d6 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -88,6 +88,7 @@ int set_pages_rw(struct page *page, int numpages);
 
 
 void clflush_cache_range(void *addr, unsigned int size);
+void clwb_cache_range(void *addr, size_t size);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 727158c..ce84d05 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -144,6 +144,29 @@ void clflush_cache_range(void *vaddr, unsigned int size)
 }
 EXPORT_SYMBOL_GPL(clflush_cache_range);
 
+/**
+ * clwb_cache_range - write back a cache range with clwb
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * clwb is an unordered instruction which needs fencing with mfence or sfence
+ * to avoid ordering issues.
+ */
+void clwb_cache_range(void *vaddr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	char *vend = (char *)vaddr + size;
+	char *p;
+
+	for (p = (char *)((unsigned long)vaddr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+
+	wmb();
+}
+EXPORT_SYMBOL_GPL(clwb_cache_range);
+
 static void __cpa_flush_all(void *arg)
 {
 	unsigned long cache = (unsigned long)arg;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] pmem: Add wb_cache_pmem() and flush_cache_pmem()
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
                   ` (2 preceding siblings ...)
  2015-08-06 17:43 ` [PATCH 3/6] x86: add clwb_cache_range() Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-06 17:43 ` [PATCH 5/6] nd_blk: add support for "read flush" DSM flag Ross Zwisler
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Toshi Kani, Juergen Gross

Add support for two new PMEM APIs, wb_cache_pmem() and
flush_cache_pmem().  The first, wb_cache_pmem(), is used to write back
ranges of dirtied cache lines to media in order to make stores durable.
The contents of the now-clean cache lines can potentially still reside
in the cache after this write back operation allowing subsequent loads
to be serviced from the cache.

The second, flush_cache_pmem(), flushes the cache lines from the
processor cache and writes them to media if they were dirty.  This can
be used to write out data that we don't believe will be read again in
the near future, for example when punching holes of zeros in a DAX file.
It can also be used as a cache invalidate when caller needs to be sure
that the cache lines are completely flushed from the processor cache
hierarchy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/cacheflush.h | 10 ++++++++++
 include/linux/pmem.h              | 25 ++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 490b3d6..ab75bdb 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -158,6 +158,16 @@ static inline void arch_wmb_pmem(void)
 	pcommit_sfence();
 }
 
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	clwb_cache_range((void __force *) addr, size);
+}
+
+static inline void arch_flush_cache_pmem(void __pmem *addr, size_t size)
+{
+	clflush_cache_range((void __force *) addr, size);
+}
+
 static inline bool __arch_has_wmb_pmem(void)
 {
 #ifdef CONFIG_X86_64
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index fb62ce4..2172578 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -39,12 +39,23 @@ static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
+
+static inline void arch_flush_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memremap_pmem_flags(),
- * arch_memcpy_to_pmem(), arch_wmb_pmem(), and __arch_has_wmb_pmem().
+ * arch_memcpy_to_pmem(), arch_wmb_pmem(), arch_wb_cache_pmem(),
+ * arch_flush_cache_pmem() and * __arch_has_wmb_pmem().
  */
 
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
@@ -146,4 +157,16 @@ static inline void wmb_pmem(void)
 	if (arch_has_pmem_api())
 		arch_wmb_pmem();
 }
+
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
+
+static inline void flush_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_flush_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] nd_blk: add support for "read flush" DSM flag
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
                   ` (3 preceding siblings ...)
  2015-08-06 17:43 ` [PATCH 4/6] pmem: Add wb_cache_pmem() and flush_cache_pmem() Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-06 17:43 ` [PATCH 6/6] dax: update I/O path to do proper PMEM flushing Ross Zwisler
  2015-08-07 16:47 ` [PATCH 0/6] pmem, dax: I/O path enhancements Dan Williams
  6 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Rafael J. Wysocki, Len Brown, linux-acpi

Add support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any new
data is read.  This ensures that any stale cache lines from the previous
contents of the aperture will be discarded from the processor cache, and the
new data will be read properly from the DIMM.  We know that the cache lines
are clean and will be discarded without any writeback because either a) the
previous aperture operation was a read, and we never modified the contents of
the aperture, or b) the previous aperture operation was a write and we must
have written back the dirtied contents of the aperture to the DIMM
before the I/O was completed.

By supporting the "read flush" flag we can also change the ND BLK aperture
mapping from write-combining to write-back via memremap_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/acpi/nfit.c | 18 +++++++++---------
 drivers/acpi/nfit.h |  6 +++++-
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c2638f..5bd6819 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1080,9 +1080,13 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		if (rw)
 			memcpy_to_pmem(mmio->aperture + offset,
 					iobuf + copied, c);
-		else
+		else {
+			if (nfit_blk->dimm_flags & ND_BLK_READ_FLUSH)
+				flush_cache_pmem(mmio->aperture + offset, c);
+
 			memcpy_from_pmem(iobuf + copied,
 					mmio->aperture + offset, c);
+		}
 
 		copied += c;
 		len -= c;
@@ -1191,13 +1195,9 @@ static void __iomem *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
 	if (!res)
 		goto err_mem;
 
-	if (type == SPA_MAP_APERTURE) {
-		/*
-		 * TODO: memremap_pmem() support, but that requires cache
-		 * flushing when the aperture is moved.
-		 */
-		spa_map->iomem = ioremap_wc(start, n);
-	} else
+	if (type == SPA_MAP_APERTURE)
+		spa_map->aperture = memremap_pmem(start, n);
+	else
 		spa_map->iomem = ioremap_nocache(start, n);
 
 	if (!spa_map->iomem)
@@ -1267,7 +1267,7 @@ static int acpi_nfit_blk_get_flags(struct nvdimm_bus_descriptor *nd_desc,
 		nfit_blk->dimm_flags = flags.flags;
 	else if (rc == -ENOTTY) {
 		/* fall back to a conservative default */
-		nfit_blk->dimm_flags = ND_BLK_DCR_LATCH;
+		nfit_blk->dimm_flags = ND_BLK_DCR_LATCH | ND_BLK_READ_FLUSH;
 		rc = 0;
 	} else
 		rc = -ENXIO;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index f2c2bb7..7c6990e 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -41,6 +41,7 @@ enum nfit_uuids {
 };
 
 enum {
+	ND_BLK_READ_FLUSH = 1,
 	ND_BLK_DCR_LATCH = 2,
 };
 
@@ -149,7 +150,10 @@ struct nfit_spa_mapping {
 	struct acpi_nfit_system_address *spa;
 	struct list_head list;
 	struct kref kref;
-	void __iomem *iomem;
+	union {
+		void __iomem *iomem;
+		void __pmem  *aperture;
+	};
 };
 
 static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] dax: update I/O path to do proper PMEM flushing
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
                   ` (4 preceding siblings ...)
  2015-08-06 17:43 ` [PATCH 5/6] nd_blk: add support for "read flush" DSM flag Ross Zwisler
@ 2015-08-06 17:43 ` Ross Zwisler
  2015-08-06 21:04   ` Dave Chinner
  2015-08-06 21:26   ` Dan Williams
  2015-08-07 16:47 ` [PATCH 0/6] pmem, dax: I/O path enhancements Dan Williams
  6 siblings, 2 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-06 17:43 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, dan.j.williams
  Cc: Ross Zwisler, Matthew Wilcox, Alexander Viro, linux-fsdevel

Update the DAX I/O path so that all operations that store data (I/O
writes, zeroing blocks, punching holes, etc.) properly synchronize the
stores to media using the PMEM API.  This ensures that the data DAX is
writing is durable on media before the operation completes.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 55 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 47c3323..e7595db 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -17,12 +17,14 @@
 #include <linux/atomic.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/highmem.h>
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
@@ -46,10 +48,13 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
 			if (pgsz > count)
 				pgsz = count;
-			if (pgsz < PAGE_SIZE)
+		if (pgsz < PAGE_SIZE) {
 				memset(addr, 0, pgsz);
-			else
+				wb_cache_pmem((void __pmem *)addr, pgsz);
+			} else {
 				clear_page(addr);
+				wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
+			}
 			addr += pgsz;
 			size -= pgsz;
 			count -= pgsz;
@@ -59,6 +64,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		}
 	} while (size);
 
+	wmb_pmem();
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
@@ -70,15 +76,24 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
 	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
 }
 
+/*
+ * This function's stores and flushes need to be synced to media by a
+ * wmb_pmem() in the caller. We flush the data instead of writing it back
+ * because we don't expect to read this newly zeroed data in the near future.
+ */
 static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
 			loff_t end)
 {
 	loff_t final = end - pos + first; /* The final byte of the buffer */
 
-	if (first > 0)
+	if (first > 0) {
 		memset(addr, 0, first);
-	if (final < size)
+		flush_cache_pmem((void __pmem *)addr, first);
+	}
+	if (final < size) {
 		memset(addr + final, 0, size - final);
+		flush_cache_pmem((void __pmem *)addr + final, size - final);
+	}
 }
 
 static bool buffer_written(struct buffer_head *bh)
@@ -108,6 +123,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	loff_t bh_max = start;
 	void *addr;
 	bool hole = false;
+	bool need_wmb = false;
 
 	if (iov_iter_rw(iter) != WRITE)
 		end = min(end, i_size_read(inode));
@@ -145,18 +161,23 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				retval = dax_get_addr(bh, &addr, blkbits);
 				if (retval < 0)
 					break;
-				if (buffer_unwritten(bh) || buffer_new(bh))
+				if (buffer_unwritten(bh) || buffer_new(bh)) {
 					dax_new_buf(addr, retval, first, pos,
 									end);
+					need_wmb = true;
+				}
 				addr += first;
 				size = retval - first;
 			}
 			max = min(pos + size, end);
 		}
 
-		if (iov_iter_rw(iter) == WRITE)
+		if (iov_iter_rw(iter) == WRITE) {
 			len = copy_from_iter_nocache(addr, max - pos, iter);
-		else if (!hole)
+			if (!iter_is_iovec(iter))
+				wb_cache_pmem((void __pmem *)addr, max - pos);
+			need_wmb = true;
+		} else if (!hole)
 			len = copy_to_iter(addr, max - pos, iter);
 		else
 			len = iov_iter_zero(max - pos, iter);
@@ -168,6 +189,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		addr += len;
 	}
 
+	if (need_wmb)
+		wmb_pmem();
+
 	return (pos == start) ? retval : pos - start;
 }
 
@@ -300,8 +324,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	if (buffer_unwritten(bh) || buffer_new(bh))
+	if (buffer_unwritten(bh) || buffer_new(bh)) {
 		clear_page(addr);
+		wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
+		wmb_pmem();
+	}
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -504,7 +531,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
 	long length;
-	void *kaddr;
+	void *kaddr, *paddr;
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
@@ -593,8 +620,12 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_page(kaddr + i * PAGE_SIZE);
+			for (i = 0; i < PTRS_PER_PMD; i++) {
+				paddr = kaddr + i * PAGE_SIZE;
+				clear_page(paddr);
+				wb_cache_pmem((void __pmem *)paddr, PAGE_SIZE);
+			}
+			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
@@ -707,6 +738,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		if (err < 0)
 			return err;
 		memset(addr + offset, 0, length);
+		flush_cache_pmem((void __pmem *)addr + offset, length);
+		wmb_pmem();
 	}
 
 	return 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] dax: update I/O path to do proper PMEM flushing
  2015-08-06 17:43 ` [PATCH 6/6] dax: update I/O path to do proper PMEM flushing Ross Zwisler
@ 2015-08-06 21:04   ` Dave Chinner
  2015-08-07 19:08     ` Ross Zwisler
  2015-08-06 21:26   ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-08-06 21:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, linux-nvdimm, dan.j.williams, Matthew Wilcox,
	Alexander Viro, linux-fsdevel

On Thu, Aug 06, 2015 at 11:43:20AM -0600, Ross Zwisler wrote:
> Update the DAX I/O path so that all operations that store data (I/O
> writes, zeroing blocks, punching holes, etc.) properly synchronize the
> stores to media using the PMEM API.  This ensures that the data DAX is
> writing is durable on media before the operation completes.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
....
> +		if (pgsz < PAGE_SIZE) {
>  				memset(addr, 0, pgsz);
> -			else
> +				wb_cache_pmem((void __pmem *)addr, pgsz);
> +			} else {
>  				clear_page(addr);
> +				wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
> +			}

I'd much prefer to see these wrapped up in helper fuctions e.g.
clear_page_pmem() rather than scatter them around randomly.
Especially the barriers - the way they've been optimised is asking
for people to get it wrong in the future.  I'd much prefer to see
the operations paired properly in a helper first (i.e. obviously
correct) and then it can be optimised later if workloads start to
show the barrier as a bottleneck...

> +/*
> + * This function's stores and flushes need to be synced to media by a
> + * wmb_pmem() in the caller. We flush the data instead of writing it back
> + * because we don't expect to read this newly zeroed data in the near future.
> + */

That seems suboptimal. dax_new_buf() is called on newly allocated or
unwritten buffers we are about to write to. Immediately after this
we write the new data to the page, so we are effectively writting
the whole page here.

So why wouldn't we simply commit the whole page during the write and
capture all this zeroing in the one flush/commit/barrier op?

>  static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
>  			loff_t end)
>  {
>  	loff_t final = end - pos + first; /* The final byte of the buffer */
>  
> -	if (first > 0)
> +	if (first > 0) {
>  		memset(addr, 0, first);
> -	if (final < size)
> +		flush_cache_pmem((void __pmem *)addr, first);
> +	}
> +	if (final < size) {
>  		memset(addr + final, 0, size - final);
> +		flush_cache_pmem((void __pmem *)addr + final, size - final);
> +	}
>  }
>  
>  static bool buffer_written(struct buffer_head *bh)
> @@ -108,6 +123,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  	loff_t bh_max = start;
>  	void *addr;
>  	bool hole = false;
> +	bool need_wmb = false;
>  
>  	if (iov_iter_rw(iter) != WRITE)
>  		end = min(end, i_size_read(inode));
> @@ -145,18 +161,23 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				retval = dax_get_addr(bh, &addr, blkbits);
>  				if (retval < 0)
>  					break;
> -				if (buffer_unwritten(bh) || buffer_new(bh))
> +				if (buffer_unwritten(bh) || buffer_new(bh)) {
>  					dax_new_buf(addr, retval, first, pos,
>  									end);
> +					need_wmb = true;
> +				}
>  				addr += first;
>  				size = retval - first;
>  			}
>  			max = min(pos + size, end);
>  		}
>  
> -		if (iov_iter_rw(iter) == WRITE)
> +		if (iov_iter_rw(iter) == WRITE) {
>  			len = copy_from_iter_nocache(addr, max - pos, iter);
> -		else if (!hole)
> +			if (!iter_is_iovec(iter))
> +				wb_cache_pmem((void __pmem *)addr, max - pos);
> +			need_wmb = true;

Conditional pmem cache writeback after a "nocache" copy to the pmem?
Comments, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] dax: update I/O path to do proper PMEM flushing
  2015-08-06 17:43 ` [PATCH 6/6] dax: update I/O path to do proper PMEM flushing Ross Zwisler
  2015-08-06 21:04   ` Dave Chinner
@ 2015-08-06 21:26   ` Dan Williams
  1 sibling, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-08-06 21:26 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox,
	Alexander Viro, linux-fsdevel

On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Update the DAX I/O path so that all operations that store data (I/O
> writes, zeroing blocks, punching holes, etc.) properly synchronize the
> stores to media using the PMEM API.  This ensures that the data DAX is
> writing is durable on media before the operation completes.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c | 55 ++++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 44 insertions(+), 11 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 47c3323..e7595db 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -17,12 +17,14 @@
>  #include <linux/atomic.h>
>  #include <linux/blkdev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/dax.h>
>  #include <linux/fs.h>
>  #include <linux/genhd.h>
>  #include <linux/highmem.h>
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pmem.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> @@ -46,10 +48,13 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>                         unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>                         if (pgsz > count)
>                                 pgsz = count;
> -                       if (pgsz < PAGE_SIZE)
> +               if (pgsz < PAGE_SIZE) {
>                                 memset(addr, 0, pgsz);
> -                       else
> +                               wb_cache_pmem((void __pmem *)addr, pgsz);
> +                       } else {
>                                 clear_page(addr);
> +                               wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
> +                       }
>                         addr += pgsz;
>                         size -= pgsz;
>                         count -= pgsz;
> @@ -59,6 +64,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>                 }
>         } while (size);
>
> +       wmb_pmem();
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(dax_clear_blocks);
> @@ -70,15 +76,24 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
>         return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
>  }
>
> +/*
> + * This function's stores and flushes need to be synced to media by a
> + * wmb_pmem() in the caller. We flush the data instead of writing it back
> + * because we don't expect to read this newly zeroed data in the near future.
> + */
>  static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
>                         loff_t end)
>  {
>         loff_t final = end - pos + first; /* The final byte of the buffer */
>
> -       if (first > 0)
> +       if (first > 0) {
>                 memset(addr, 0, first);
> -       if (final < size)
> +               flush_cache_pmem((void __pmem *)addr, first);

Why are we invalidating vs just writing back?  Isn't there a
possibility that the cpu will read these zeroes, in which case why
force it go to memory?  Let the cpu figure out when these writes are
evicted from the cache hierarchy or otherwise include some performance
numbers showing it is a win to force the eviction early.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
@ 2015-08-07  6:38   ` Christoph Hellwig
  2015-08-07 14:07     ` Ross Zwisler
  2015-08-07 16:14   ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2015-08-07  6:38 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-kernel, linux-nvdimm, dan.j.williams

On Thu, Aug 06, 2015 at 11:43:15AM -0600, Ross Zwisler wrote:
> Prior to this change arch_has_wmb_pmem() was only called by
> arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
> checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.
> 
> Instead, remove one extra layer of indirection and the redundant
> CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
> call __arch_has_wmb_pmem() directly.

The description here matches the codde, but not the suject line..


Reviewed-by: Christoph Hellwig <hch@lst.de>

with a fixed up subject line.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] x86: clean up conditional pmem includes
  2015-08-06 17:43 ` [PATCH 2/6] x86: clean up conditional pmem includes Ross Zwisler
@ 2015-08-07  6:39   ` Christoph Hellwig
  2015-08-07 14:08     ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2015-08-07  6:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, linux-nvdimm, dan.j.williams, Juergen Gross, x86,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, Thomas Gleixner

On Thu, Aug 06, 2015 at 11:43:16AM -0600, Ross Zwisler wrote:
> Prior to this change x86_64 used the pmem defines in
> arch/x86/include/asm/cacheflush.h, and UM used the default ones at the
> top of include/linux/pmem.h.  The inclusion or exclusion in pmem.h was
> controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in cacheflush.h
> were controlled by ARCH_HAS_NOCACHE_UACCESS.
> 
> Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
> clear that they are related and we don't run into the possibility where
> they are both included or excluded.  Also remove a bunch of stale
> function prototypes meant for UM in cacheflush.h - these just conflicted
> with the inline defaults in pmem.h, and gave compile errors:

This looks reasonable, but can you use the opportunity to also move
the pmem arch inlines from asm/cacheflush.h to a new asm/pmem.h?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-07  6:38   ` Christoph Hellwig
@ 2015-08-07 14:07     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-07 14:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-nvdimm, dan.j.williams

On Thu, 2015-08-06 at 23:38 -0700, Christoph Hellwig wrote:
> On Thu, Aug 06, 2015 at 11:43:15AM -0600, Ross Zwisler wrote:
> > Prior to this change arch_has_wmb_pmem() was only called by
> > arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
> > checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.
> > 
> > Instead, remove one extra layer of indirection and the redundant
> > CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
> > call __arch_has_wmb_pmem() directly.
> 
> The description here matches the codde, but not the suject line..
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> with a fixed up subject line.

Ah, thank you, will fix.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] x86: clean up conditional pmem includes
  2015-08-07  6:39   ` Christoph Hellwig
@ 2015-08-07 14:08     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-07 14:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-nvdimm, dan.j.williams, Juergen Gross, x86,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, Thomas Gleixner

On Thu, 2015-08-06 at 23:39 -0700, Christoph Hellwig wrote:
> On Thu, Aug 06, 2015 at 11:43:16AM -0600, Ross Zwisler wrote:
> > Prior to this change x86_64 used the pmem defines in
> > arch/x86/include/asm/cacheflush.h, and UM used the default ones at the
> > top of include/linux/pmem.h.  The inclusion or exclusion in pmem.h was
> > controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in cacheflush.h
> > were controlled by ARCH_HAS_NOCACHE_UACCESS.
> > 
> > Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
> > clear that they are related and we don't run into the possibility where
> > they are both included or excluded.  Also remove a bunch of stale
> > function prototypes meant for UM in cacheflush.h - these just conflicted
> > with the inline defaults in pmem.h, and gave compile errors:
> 
> This looks reasonable, but can you use the opportunity to also move
> the pmem arch inlines from asm/cacheflush.h to a new asm/pmem.h?

Sure, it that seems like a good idea.  Thanks.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
  2015-08-07  6:38   ` Christoph Hellwig
@ 2015-08-07 16:14   ` Dan Williams
  2015-08-07 18:41     ` Ross Zwisler
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-08-07 16:14 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-kernel, linux-nvdimm@lists.01.org, Christoph Hellwig

On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Prior to this change arch_has_wmb_pmem() was only called by
> arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
> checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.
>
> Instead, remove one extra layer of indirection and the redundant
> CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
> call __arch_has_wmb_pmem() directly.

So I think this patch takes us further away from where we want to go
in the near term which is a finer grained pmem api.  The class of
systems where (has_pmem_api() && !has_wmb_pmem()) is existing energy
backed nvdimm platforms.  I'm assuming those platforms will want to
assert persistence guarantees in the absence of a pcommit-like
instruction, and that we want to stop gating arch_has_pmem_api() on
the presence of wmb_pmem() capability.  In that case
arch_has_wmb_pmem() will be useful to have and that was the original
intent for including it, that intent did not seem to comprehended in
the changelog.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] pmem, dax: I/O path enhancements
  2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
                   ` (5 preceding siblings ...)
  2015-08-06 17:43 ` [PATCH 6/6] dax: update I/O path to do proper PMEM flushing Ross Zwisler
@ 2015-08-07 16:47 ` Dan Williams
  2015-08-07 19:06   ` Ross Zwisler
  6 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-08-07 16:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, linux-nvdimm@lists.01.org, Alexander Viro,
	Borislav Petkov, H. Peter Anvin, Ingo Molnar, Juergen Gross,
	Len Brown, Linux ACPI, linux-fsdevel, Luis R. Rodriguez,
	Matthew Wilcox, Rafael J. Wysocki, Thomas Gleixner, Toshi Kani,
	X86 ML

On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Patch 5 adds support for the "read flush" _DSM flag, allowing us to change the
> ND BLK aperture mapping from write-combining to write-back via memremap_pmem().
>
> Patch 6 updates the DAX I/O path so that all operations that store data (I/O
> writes, zeroing blocks, punching holes, etc.) properly synchronize the stores
> to media using the PMEM API.  This ensures that the data DAX is writing is
> durable on media before the operation completes.
>
> Patches 1-4 are cleanup patches and additions to the PMEM API that make
> patches 5 and 6 possible.
>
> Regarding the choice to add both flush_cache_pmem() and wb_cache_pmem() to the
> PMEM API, I had initially implemented flush_cache_pmem() as a generic function
> flush_io_cache_range() in the spirit of flush_cache_range(), etc., in
> cacheflush.h.  I eventually moved it into the PMEM API because a) it has a
> common and consistent use of the __pmem annotation, b) it has a clear fallback
> method for architectures that don't support it, as opposed to APIs in
> cacheflush.h which would need to be added individually to all other
> architectures.  It can be argued that the flush API could apply to other uses
> beyond PMEM such as flushing cache lines associated with other types of
> sliding MMIO windows.  At this point I'm inclined to have it as part of the
> PMEM API, and then take on the effort of making it a general cache flusing API
> if other users come along.

I'm not convinced.  There are already existing users for invalidating
a cpu cache and they currently jump through hoops to have cross-arch
flushing, see drm_clflush_pages().  What the NFIT-BLK driver brings to
the table is just one more instance where the cpu cache needs to be
invalidated, and for something so fundamental it is time we had a
cross arch generic helper.

The cache-writeback case is different.  To date we've only used
writeback for i/o-incoherent archs.  x86 now for the first time needs
(potentially) a writeback api specifically for guaranteeing
persistence.  I say "potentially" because all the cases where we need
to guarantee persistence could be handled with non-temporal stores.

The __pmem annotation is a separate issue that we need to tackle.  I
think Christoph is already on team "__pmem is a mistake", but I think
we should walk through what carrying it forward would look like.  The
__pfn_t patches allow for flags to be attached to the pfn(s) returned
from ->direct_access().  We could add a PFN_PMEM flag and teach
kmap_atomic_pfn_t() to only operate on !PFN_PMEM pfns.  A new
"kmap_atomic_pmem()" would be needed to map pfns from the pmem
driver's ->direct_access() and that would return "void __pmem *".  I
think this would force DAX to always be "__pmem clean" regardless of
whether we got the pfns from BRD or PMEM.  It becomes messy when we
consider carrying __pfn_t in a bio_vec.  But, I think it becomes messy
in precisely the right way in that drivers that want to setup
DMA-to-pmem should consciously be handling the __pmem annotation and
the resulting side effects.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-07 16:14   ` Dan Williams
@ 2015-08-07 18:41     ` Ross Zwisler
  2015-08-07 20:01       ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-08-07 18:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-kernel, linux-nvdimm@lists.01.org, Christoph Hellwig

On Fri, 2015-08-07 at 09:14 -0700, Dan Williams wrote:
> On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > Prior to this change arch_has_wmb_pmem() was only called by
> > arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
> > checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.
> >
> > Instead, remove one extra layer of indirection and the redundant
> > CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
> > call __arch_has_wmb_pmem() directly.
> 
> So I think this patch takes us further away from where we want to go
> in the near term which is a finer grained pmem api.  The class of
> systems where (has_pmem_api() && !has_wmb_pmem()) is existing energy
> backed nvdimm platforms.  I'm assuming those platforms will want to
> assert persistence guarantees in the absence of a pcommit-like
> instruction, and that we want to stop gating arch_has_pmem_api() on
> the presence of wmb_pmem() capability.  In that case
> arch_has_wmb_pmem() will be useful to have and that was the original
> intent for including it, that intent did not seem to comprehended in
> the changelog.

I think that we should only keep around functions that are actually used and
useful.  I agree that there could potentially be a use for a distinction like
the one you are talking about when we try and properly support ADR systems and
stop lumping them in with "non-PCOMMIT" systems like we are doing now.

Right now, though, arch_has_wmb_pmem() is just redundant.  If/when we add that
new support in, we'll have to properly update this code anyway - let's not
keep around unneeded code until then.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] pmem, dax: I/O path enhancements
  2015-08-07 16:47 ` [PATCH 0/6] pmem, dax: I/O path enhancements Dan Williams
@ 2015-08-07 19:06   ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-07 19:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-nvdimm@lists.01.org, Alexander Viro,
	Borislav Petkov, H. Peter Anvin, Ingo Molnar, Juergen Gross,
	Len Brown, Linux ACPI, linux-fsdevel, Luis R. Rodriguez,
	Matthew Wilcox, Rafael J. Wysocki, Thomas Gleixner, Toshi Kani,
	X86 ML

On Fri, 2015-08-07 at 09:47 -0700, Dan Williams wrote:
> On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > Patch 5 adds support for the "read flush" _DSM flag, allowing us to change the
> > ND BLK aperture mapping from write-combining to write-back via memremap_pmem().
> >
> > Patch 6 updates the DAX I/O path so that all operations that store data (I/O
> > writes, zeroing blocks, punching holes, etc.) properly synchronize the stores
> > to media using the PMEM API.  This ensures that the data DAX is writing is
> > durable on media before the operation completes.
> >
> > Patches 1-4 are cleanup patches and additions to the PMEM API that make
> > patches 5 and 6 possible.
> >
> > Regarding the choice to add both flush_cache_pmem() and wb_cache_pmem() to the
> > PMEM API, I had initially implemented flush_cache_pmem() as a generic function
> > flush_io_cache_range() in the spirit of flush_cache_range(), etc., in
> > cacheflush.h.  I eventually moved it into the PMEM API because a) it has a
> > common and consistent use of the __pmem annotation, b) it has a clear fallback
> > method for architectures that don't support it, as opposed to APIs in
> > cacheflush.h which would need to be added individually to all other
> > architectures.  It can be argued that the flush API could apply to other uses
> > beyond PMEM such as flushing cache lines associated with other types of
> > sliding MMIO windows.  At this point I'm inclined to have it as part of the
> > PMEM API, and then take on the effort of making it a general cache flusing API
> > if other users come along.
> 
> I'm not convinced.  There are already existing users for invalidating
> a cpu cache and they currently jump through hoops to have cross-arch
> flushing, see drm_clflush_pages().  What the NFIT-BLK driver brings to
> the table is just one more instance where the cpu cache needs to be
> invalidated, and for something so fundamental it is time we had a
> cross arch generic helper.

Fair enough.  I'll move back to the flush_io_cache_range() solution.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] dax: update I/O path to do proper PMEM flushing
  2015-08-06 21:04   ` Dave Chinner
@ 2015-08-07 19:08     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-08-07 19:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-nvdimm, dan.j.williams, Matthew Wilcox,
	Alexander Viro, linux-fsdevel

On Fri, 2015-08-07 at 07:04 +1000, Dave Chinner wrote:
> On Thu, Aug 06, 2015 at 11:43:20AM -0600, Ross Zwisler wrote:
> > Update the DAX I/O path so that all operations that store data (I/O
> > writes, zeroing blocks, punching holes, etc.) properly synchronize the
> > stores to media using the PMEM API.  This ensures that the data DAX is
> > writing is durable on media before the operation completes.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ....
> > +		if (pgsz < PAGE_SIZE) {
> >  				memset(addr, 0, pgsz);
> > -			else
> > +				wb_cache_pmem((void __pmem *)addr, pgsz);
> > +			} else {
> >  				clear_page(addr);
> > +				wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
> > +			}
> 
> I'd much prefer to see these wrapped up in helper fuctions e.g.
> clear_page_pmem() rather than scatter them around randomly.
> Especially the barriers - the way they've been optimised is asking
> for people to get it wrong in the future.  I'd much prefer to see
> the operations paired properly in a helper first (i.e. obviously
> correct) and then it can be optimised later if workloads start to
> show the barrier as a bottleneck...
> 
> > +/*
> > + * This function's stores and flushes need to be synced to media by a
> > + * wmb_pmem() in the caller. We flush the data instead of writing it back
> > + * because we don't expect to read this newly zeroed data in the near future.
> > + */
> 
> That seems suboptimal. dax_new_buf() is called on newly allocated or
> unwritten buffers we are about to write to. Immediately after this
> we write the new data to the page, so we are effectively writting
> the whole page here.
> 
> So why wouldn't we simply commit the whole page during the write and
> capture all this zeroing in the one flush/commit/barrier op?
> 
> >  static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
> >  			loff_t end)
> >  {
> >  	loff_t final = end - pos + first; /* The final byte of the buffer */
> >  
> > -	if (first > 0)
> > +	if (first > 0) {
> >  		memset(addr, 0, first);
> > -	if (final < size)
> > +		flush_cache_pmem((void __pmem *)addr, first);
> > +	}
> > +	if (final < size) {
> >  		memset(addr + final, 0, size - final);
> > +		flush_cache_pmem((void __pmem *)addr + final, size - final);
> > +	}
> >  }
> >  
> >  static bool buffer_written(struct buffer_head *bh)
> > @@ -108,6 +123,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> >  	loff_t bh_max = start;
> >  	void *addr;
> >  	bool hole = false;
> > +	bool need_wmb = false;
> >  
> >  	if (iov_iter_rw(iter) != WRITE)
> >  		end = min(end, i_size_read(inode));
> > @@ -145,18 +161,23 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> >  				retval = dax_get_addr(bh, &addr, blkbits);
> >  				if (retval < 0)
> >  					break;
> > -				if (buffer_unwritten(bh) || buffer_new(bh))
> > +				if (buffer_unwritten(bh) || buffer_new(bh)) {
> >  					dax_new_buf(addr, retval, first, pos,
> >  									end);
> > +					need_wmb = true;
> > +				}
> >  				addr += first;
> >  				size = retval - first;
> >  			}
> >  			max = min(pos + size, end);
> >  		}
> >  
> > -		if (iov_iter_rw(iter) == WRITE)
> > +		if (iov_iter_rw(iter) == WRITE) {
> >  			len = copy_from_iter_nocache(addr, max - pos, iter);
> > -		else if (!hole)
> > +			if (!iter_is_iovec(iter))
> > +				wb_cache_pmem((void __pmem *)addr, max - pos);
> > +			need_wmb = true;
> 
> Conditional pmem cache writeback after a "nocache" copy to the pmem?
> Comments, please.
> 
> Cheers,
> 
> Dave.

I agree with all your comments, and will address them in v2.  Thank you for
the feedback.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api()
  2015-08-07 18:41     ` Ross Zwisler
@ 2015-08-07 20:01       ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-08-07 20:01 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-kernel, linux-nvdimm@lists.01.org, Christoph Hellwig

On Fri, Aug 7, 2015 at 11:41 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, 2015-08-07 at 09:14 -0700, Dan Williams wrote:
>> On Thu, Aug 6, 2015 at 10:43 AM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > Prior to this change arch_has_wmb_pmem() was only called by
>> > arch_has_pmem_api().  Both arch_has_wmb_pmem() and arch_has_pmem_api()
>> > checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.
>> >
>> > Instead, remove one extra layer of indirection and the redundant
>> > CONFIG_ARCH_HAS_PMEM_API check, and just have arch_has_pmem_api()
>> > call __arch_has_wmb_pmem() directly.
>>
>> So I think this patch takes us further away from where we want to go
>> in the near term which is a finer grained pmem api.  The class of
>> systems where (has_pmem_api() && !has_wmb_pmem()) is existing energy
>> backed nvdimm platforms.  I'm assuming those platforms will want to
>> assert persistence guarantees in the absence of a pcommit-like
>> instruction, and that we want to stop gating arch_has_pmem_api() on
>> the presence of wmb_pmem() capability.  In that case
>> arch_has_wmb_pmem() will be useful to have and that was the original
>> intent for including it, that intent did not seem to comprehended in
>> the changelog.
>
> I think that we should only keep around functions that are actually used and
> useful.  I agree that there could potentially be a use for a distinction like
> the one you are talking about when we try and properly support ADR systems and
> stop lumping them in with "non-PCOMMIT" systems like we are doing now.
>
> Right now, though, arch_has_wmb_pmem() is just redundant.  If/when we add that
> new support in, we'll have to properly update this code anyway - let's not
> keep around unneeded code until then.

I don't see the pain in carrying around one unused static inline
especially if we'll have a use for it in the near term.  But, if it
needs to go then __arch_has_wmb_pmem() needs to be renamed to drop the
"__" since there is no longer a wrapper.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-08-07 20:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-06 17:43 [PATCH 0/6] pmem, dax: I/O path enhancements Ross Zwisler
2015-08-06 17:43 ` [PATCH 1/6] pmem: remove indirection layer arch_has_pmem_api() Ross Zwisler
2015-08-07  6:38   ` Christoph Hellwig
2015-08-07 14:07     ` Ross Zwisler
2015-08-07 16:14   ` Dan Williams
2015-08-07 18:41     ` Ross Zwisler
2015-08-07 20:01       ` Dan Williams
2015-08-06 17:43 ` [PATCH 2/6] x86: clean up conditional pmem includes Ross Zwisler
2015-08-07  6:39   ` Christoph Hellwig
2015-08-07 14:08     ` Ross Zwisler
2015-08-06 17:43 ` [PATCH 3/6] x86: add clwb_cache_range() Ross Zwisler
2015-08-06 17:43 ` [PATCH 4/6] pmem: Add wb_cache_pmem() and flush_cache_pmem() Ross Zwisler
2015-08-06 17:43 ` [PATCH 5/6] nd_blk: add support for "read flush" DSM flag Ross Zwisler
2015-08-06 17:43 ` [PATCH 6/6] dax: update I/O path to do proper PMEM flushing Ross Zwisler
2015-08-06 21:04   ` Dave Chinner
2015-08-07 19:08     ` Ross Zwisler
2015-08-06 21:26   ` Dan Williams
2015-08-07 16:47 ` [PATCH 0/6] pmem, dax: I/O path enhancements Dan Williams
2015-08-07 19:06   ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).