linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] I/O path improvements for ND_BLK and PMEM
@ 2015-05-28 22:35 Ross Zwisler
  2015-05-28 22:35 ` [PATCH 1/6] pmem: add force casts to avoid __iomem annotation Ross Zwisler
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm; +Cc: Ross Zwisler

This series adds a new PMEM API consisting of three functions:
persistent_copy(), persistent_flush() and persistent_sync().

These three functions are then used in the I/O paths for both the ND_BLK driver
and the PMEM driver to ensure that writes actually make it to the DIMM and
become durable before the I/O operation completes.

The first two patches in the series are just cleanup and correctness patches.

Patch three provides a reasonable architecture neutral default implementation
for these three APIs for architectures that do not implement the PMEM API.
These defaults allow all architectures to mostly work, aliasing
persistent_copy() to memcpy() and having persistent_flush() and
persistent_sync() be noops.  With this patch set this implementation is
provided at the pmem.h level.

It's possible that other future consumers of the PMEM API (DAX, possibly
others) would prefer to have a different default behavior for architectures
that don't support the PMEM API.  If this is the case we could move the choice
about what to do for those architectures down into consumer-specific header
files, so nd.h for libnd, for example.  If DAX and other consumers are fine
with our defaults it's nicer to keep them common and in a global place.  Please
let us know how other future consumers of the PMEM API feel about this.

Patches 5 and 6 update the I/O paths for flush hints and NVDIMM flags.

This series applies cleanly to Dan's "ndctl-for-next" tree:

https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/log/?h=libnd-for-next

One last note - I'm going to be unavailable soon, so patch feedback will most
likely be handled by Dan Williams.  Thanks, Dan. :)

Ross Zwisler (6):
  pmem: add force casts to avoid __iomem annotation
  nfit: Fix up address spaces, sparse warnings
  x86, pmem: add PMEM API for persistent memory
  pmem, nd_blk: update I/O paths to use PMEM API
  nd_blk: add support for flush hints
  nd_blk: add support for NVDIMM flags

 MAINTAINERS                       |  1 +
 arch/x86/Kconfig                  |  3 ++
 arch/x86/include/asm/cacheflush.h | 23 ++++++++++
 drivers/acpi/nfit.c               | 89 ++++++++++++++++++++++++++++++++++-----
 drivers/acpi/nfit.h               | 28 +++++++++++-
 drivers/block/nd/pmem.c           | 22 +++++++---
 include/linux/pmem.h              | 79 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/ndctl.h        |  5 +++
 8 files changed, 232 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/pmem.h

-- 
1.9.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] pmem: add force casts to avoid __iomem annotation
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  2015-05-28 22:47   ` Dan Williams
  2015-05-28 22:35 ` [PATCH 2/6] nfit: Fix up address spaces, sparse warnings Ross Zwisler
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm; +Cc: Ross Zwisler, Dan Williams

Even though we use ioremap_nocache() to map our persistent memory in the
pmem driver, the memory we are mapping behaves like normal memory and
not I/O memory in that it can be accessed using regular memcpy()
operations and doesn't need to go through memcpy_toio() and
memcpy_fromio().  Force casting the pointers received from
ioremap_nocache() and given to iounmap() gives us the correct behavior
and makes sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
---
 drivers/block/nd/pmem.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 5e8c9c629f22..a8712e41e7f5 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -163,7 +163,8 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	 * of the CPU caches in case of a crash.
 	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = (__force void *)ioremap_nocache(pmem->phys_addr,
+			pmem->size);
 	if (!pmem->virt_addr)
 		goto out_release_region;
 
@@ -195,7 +196,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 out_free_queue:
 	blk_cleanup_queue(pmem->pmem_queue);
 out_unmap:
-	iounmap(pmem->virt_addr);
+	iounmap((__force void __iomem *)pmem->virt_addr);
 out_release_region:
 	release_mem_region(pmem->phys_addr, pmem->size);
 out_free_dev:
@@ -209,7 +210,7 @@ static void pmem_free(struct pmem_device *pmem)
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
 	blk_cleanup_queue(pmem->pmem_queue);
-	iounmap(pmem->virt_addr);
+	iounmap((__force void __iomem *)pmem->virt_addr);
 	release_mem_region(pmem->phys_addr, pmem->size);
 	kfree(pmem);
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] nfit: Fix up address spaces, sparse warnings
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
  2015-05-28 22:35 ` [PATCH 1/6] pmem: add force casts to avoid __iomem annotation Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  2015-05-28 22:40   ` Dan Williams
  2015-05-28 22:35 ` [PATCH 3/6] x86, pmem: add PMEM API for persistent memory Ross Zwisler
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm
  Cc: Ross Zwisler, Dan Williams, Rafael J. Wysocki

Fix a couple of sparse warnings (duplicate const, incorrect address
space), and go back to using memcpy() instead of memcpy_toio() and
memcpy_fromio() when talking to our block apertures.  Instead, include a
union to alias mmio->base and mmio->aperture so that we can continue to
reuse common code for ioremapping and deinterleaving.  mmio->base still
has the __iomem annotation and is used via readq() and writeq() for the
control and status registers.  mmio->aperture is used via normal
memcpy() for aperture I/O.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-nvdimm@lists.01.org
Cc: linux-acpi@vger.kernel.org
---
 drivers/acpi/nfit.c | 10 +++++-----
 drivers/acpi/nfit.h |  5 ++++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 34f879808ada..df14652ea13e 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -60,7 +60,7 @@ static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
 		unsigned int buf_len)
 {
 	struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc);
-	const struct nd_cmd_desc const *desc = NULL;
+	const struct nd_cmd_desc *desc = NULL;
 	union acpi_object in_obj, in_buf, *out_obj;
 	struct device *dev = acpi_desc->dev;
 	const char *cmd_name, *dimm_name;
@@ -895,7 +895,7 @@ static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
 
 static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
 {
-	struct nfit_blk_mmio __iomem *mmio = &nfit_blk->mmio[DCR];
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
 	u64 offset = nfit_blk->stat_offset + mmio->size * bw;
 
 	if (mmio->num_lines)
@@ -908,7 +908,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 		resource_size_t dpa, unsigned int len, unsigned int write)
 {
 	u64 cmd, offset;
-	struct nfit_blk_mmio __iomem *mmio = &nfit_blk->mmio[DCR];
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
 
 	enum {
 		BCW_OFFSET_MASK = (1ULL << 48)-1,
@@ -959,9 +959,9 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
 		}
 
 		if (write)
-			memcpy_fromio(mmio->base + offset, iobuf + copied, c);
+			memcpy(mmio->aperture + offset, iobuf + copied, c);
 		else
-			memcpy_toio(iobuf + copied, mmio->base + offset, c);
+			memcpy(iobuf + copied, mmio->aperture + offset, c);
 
 		copied += c;
 		len -= c;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index e462214f94cd..b882a22ee7bb 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -115,7 +115,10 @@ enum nd_blk_mmio_selector {
 
 struct nfit_blk {
 	struct nfit_blk_mmio {
-		void __iomem *base;
+		union {
+			void __iomem *base;
+			void *aperture;
+		};
 		u64 size;
 		u64 base_offset;
 		u32 line_size;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
  2015-05-28 22:35 ` [PATCH 1/6] pmem: add force casts to avoid __iomem annotation Ross Zwisler
  2015-05-28 22:35 ` [PATCH 2/6] nfit: Fix up address spaces, sparse warnings Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  2015-05-28 23:20   ` H. Peter Anvin
  2015-05-28 22:35 ` [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API Ross Zwisler
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm
  Cc: Ross Zwisler, Dan Williams, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Add a new PMEM API to x86, and allow for architectures that do not
implement this API.  Architectures that implement the PMEM API should
define ARCH_HAS_PMEM_API in their kernel configuration and must provide
implementations for persistent_copy(), persistent_flush() and
persistent_sync().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-nvdimm@lists.01.org
---
 MAINTAINERS                       |  1 +
 arch/x86/Kconfig                  |  3 ++
 arch/x86/include/asm/cacheflush.h | 23 ++++++++++++
 include/linux/pmem.h              | 79 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+)
 create mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 0448fec8e44a..ca1f3d99618d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5944,6 +5944,7 @@ L:	linux-nvdimm@lists.01.org
 Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
 S:	Supported
 F:	drivers/block/nd/pmem.c
+F:	include/linux/pmem.h
 
 LINUX FOR IBM pSERIES (RS/6000)
 M:	Paul Mackerras <paulus@au.ibm.com>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 23c587938804..eb8f12e715af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -215,6 +215,9 @@ config ARCH_HAS_CPU_RELAX
 config ARCH_HAS_CACHE_LINE_SIZE
 	def_bool y
 
+config ARCH_HAS_PMEM_API
+	def_bool y
+
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 
diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 47c8e32f621a..ffd5ccdc86f0 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -4,6 +4,7 @@
 /* Caches aren't brain-dead on the intel. */
 #include <asm-generic/cacheflush.h>
 #include <asm/special_insns.h>
+#include <asm/uaccess.h>
 
 /*
  * The set_memory_* API can be used to change various attributes of a virtual
@@ -84,6 +85,28 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+static inline void arch_persistent_copy(void *dst, const void *src, size_t n)
+{
+	/*
+	 * We are copying between two kernel buffers, so it should be
+	 * impossible for us to hit this BUG_ON() because we should never need
+	 * to take a page fault.
+	 */
+	BUG_ON(__copy_from_user_inatomic_nocache(dst,
+				(__user const void *)src, n));
+}
+
+static inline void arch_persistent_flush(void *vaddr, size_t size)
+{
+	clflush_cache_range(vaddr, size);
+}
+
+static inline void arch_persistent_sync(void)
+{
+	wmb();
+	pcommit_sfence();
+}
+
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
 extern const int rodata_test_data;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
new file mode 100644
index 000000000000..88ade7376632
--- /dev/null
+++ b/include/linux/pmem.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __PMEM_H__
+#define __PMEM_H__
+
+#include <asm/cacheflush.h>
+
+/*
+ * Architectures that define ARCH_HAS_PMEM_API must provide implementations
+ * for persistent_copy(), persistent_flush() and persistent_sync().
+ */
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+/**
+ * persistent_copy - copy data to persistent memory
+ * @dst: destination buffer for the copy
+ * @src: source buffer for the copy
+ * @n: length of the copy in bytes
+ *
+ * Perform a memory copy that results in the destination of the copy being
+ * evicted from the processor cache hierarchy ("accepted to memory").  This
+ * can typically be accomplished with non-temporal stores or by regular stores
+ * followed by cache flush commands via persistent_flush().
+ */
+static inline void persistent_copy(void *dst, const void *src, size_t n)
+{
+	arch_persistent_copy(dst, src, n);
+}
+
+/**
+ * persistent_flush - flush a memory range from the processor cache
+ * @vaddr: virtual address to begin flushing
+ * @size: number of bytes to flush
+ *
+ * This call needs to include fencing so that the flushing will be ordered
+ * with respect to both reads and writes.
+ */
+static inline void persistent_flush(void *vaddr, size_t size)
+{
+	arch_persistent_flush(vaddr, size);
+}
+
+/**
+ * persistent_sync - synchronize writes to persistent memory
+ *
+ * To be used after a series of copies and/or flushes, this should perform any
+ * necessary fencing to order writes/flushes and then ensure that writes that
+ * have been "accepted to memory", i.e. previously flushed from the cache
+ * hierarchy, are actually written to the appropriate DIMMs.  This typically
+ * involves making sure they are flushed from any platform buffers not covered
+ * by the cache flushes performed by persistent_flush().
+ */
+static inline void persistent_sync(void)
+{
+	arch_persistent_sync();
+}
+
+#else /* ! CONFIG_ARCH_HAS_PMEM_API */
+
+static inline void persistent_copy(void *dst, const void *src, size_t n)
+{
+	memcpy(dst, src, n);
+}
+static inline void persistent_flush(void *vaddr, size_t size) {}
+static inline void persistent_sync(void) {}
+
+#endif /* CONFIG_ARCH_HAS_PMEM_API */
+
+#endif /* __PMEM_H__ */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
                   ` (2 preceding siblings ...)
  2015-05-28 22:35 ` [PATCH 3/6] x86, pmem: add PMEM API for persistent memory Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  2015-05-29 14:11   ` Dan Williams
  2015-05-28 22:35 ` [PATCH 5/6] nd_blk: add support for flush hints Ross Zwisler
  2015-05-28 22:35 ` [PATCH 6/6] nd_blk: add support for NVDIMM flags Ross Zwisler
  5 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm
  Cc: Ross Zwisler, Dan Williams, Rafael J. Wysocki

Update the PMEM and ND_BLK I/O paths to use the new PMEM API and to
adhere to the read/write flows outlined in the "NVDIMM Block Window
Driver Writer's Guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-acpi@vger.kernel.org
---
 drivers/acpi/nfit.c     | 10 ++++++++--
 drivers/block/nd/pmem.c | 15 +++++++++++----
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index df14652ea13e..22df61968c1c 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -18,6 +18,7 @@
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include <linux/sort.h>
+#include <linux/pmem.h>
 #include <linux/io.h>
 #include "nfit.h"
 
@@ -926,7 +927,9 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 	if (mmio->num_lines)
 		offset = to_interleave_offset(offset, mmio);
 
+	/* mmio->base must be mapped uncacheable */
 	writeq(cmd, mmio->base + offset);
+	persistent_sync();
 	/* FIXME: conditionally perform read-back if mandated by firmware */
 }
 
@@ -940,7 +943,6 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
 	int rc;
 
 	base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES + bw * mmio->size;
-	/* TODO: non-temporal access, flush hints, cache management etc... */
 	write_blk_ctl(nfit_blk, bw, dpa, len, write);
 	while (len) {
 		unsigned int c;
@@ -959,13 +961,17 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
 		}
 
 		if (write)
-			memcpy(mmio->aperture + offset, iobuf + copied, c);
+			persistent_copy(mmio->aperture + offset, iobuf + copied, c);
 		else
 			memcpy(iobuf + copied, mmio->aperture + offset, c);
 
 		copied += c;
 		len -= c;
 	}
+
+	if (write)
+		persistent_sync();
+
 	rc = read_blk_stat(nfit_blk, bw) ? -EIO : 0;
 	return rc;
 }
diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index a8712e41e7f5..1a447c351aef 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -15,15 +15,16 @@
  * more details.
  */
 
-#include <asm/cacheflush.h>
 #include <linux/blkdev.h>
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/pmem.h>
 #include <linux/slab.h>
 #include <linux/nd.h>
+#include <asm/cacheflush.h>
 #include "nd.h"
 
 struct pmem_device {
@@ -51,7 +52,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+		persistent_copy(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -82,6 +83,8 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 		sector += bvec.bv_len >> 9;
 	}
 
+	if (rw & WRITE)
+		persistent_sync();
 out:
 	bio_endio(bio, err);
 }
@@ -92,6 +95,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 
 	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	if (rw & WRITE)
+		persistent_sync();
 	page_endio(page, rw & WRITE, 0);
 
 	return 0;
@@ -111,8 +116,10 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
 
 	if (rw == READ)
 		memcpy(buf, pmem->virt_addr + offset, n);
-	else
-		memcpy(pmem->virt_addr + offset, buf, n);
+	else {
+		persistent_copy(pmem->virt_addr + offset, buf, n);
+		persistent_sync();
+	}
 
 	return 0;
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] nd_blk: add support for flush hints
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
                   ` (3 preceding siblings ...)
  2015-05-28 22:35 ` [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  2015-05-28 22:35 ` [PATCH 6/6] nd_blk: add support for NVDIMM flags Ross Zwisler
  5 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm
  Cc: Ross Zwisler, Dan Williams, Rafael J. Wysocki

Add support for flush hints and use them in the nd_blk I/O path instead
of the global persistent_sync() whenever we can.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-nvdimm@lists.01.org
Cc: linux-acpi@vger.kernel.org
---
 drivers/acpi/nfit.c | 49 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/acpi/nfit.h | 22 ++++++++++++++++++++++
 2 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 22df61968c1c..6cca50d1c690 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -286,9 +286,20 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void
 				idt->interleave_index, idt->line_count);
 		break;
 	}
-	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
-		dev_dbg(dev, "%s: flush\n", __func__);
+	case ACPI_NFIT_TYPE_FLUSH_ADDRESS: {
+		struct nfit_flush *nfit_flush = devm_kzalloc(dev,
+				sizeof(*nfit_flush), GFP_KERNEL);
+		struct acpi_nfit_flush_address *flush = table;
+
+		if (!nfit_flush)
+			return err;
+		INIT_LIST_HEAD(&nfit_flush->list);
+		nfit_flush->flush = flush;
+		list_add_tail(&nfit_flush->list, &acpi_desc->flushes);
+		dev_dbg(dev, "%s: flush_hint handle: %d hint_count: %d\n",
+			__func__, flush->device_handle, flush->hint_count);
 		break;
+	}
 	case ACPI_NFIT_TYPE_SMBIOS:
 		dev_dbg(dev, "%s: smbios\n", __func__);
 		break;
@@ -338,6 +349,7 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 {
 	u16 dcr_index = __to_nfit_memdev(nfit_mem)->region_index;
 	struct nfit_memdev *nfit_memdev;
+	struct nfit_flush *nfit_flush;
 	struct nfit_dcr *nfit_dcr;
 	struct nfit_bdw *nfit_bdw;
 	struct nfit_idt *nfit_idt;
@@ -384,6 +396,7 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 				nfit_memdev->memdev->region_index != dcr_index)
 			continue;
 		nfit_mem->memdev_bdw = nfit_memdev->memdev;
+
 		idt_index = nfit_memdev->memdev->interleave_index;
 		list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
 			if (nfit_idt->idt->interleave_index != idt_index)
@@ -391,6 +404,14 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 			nfit_mem->idt_bdw = nfit_idt->idt;
 			break;
 		}
+
+		list_for_each_entry(nfit_flush, &acpi_desc->flushes, list) {
+			if (nfit_flush->flush->device_handle !=
+					nfit_memdev->memdev->device_handle)
+				continue;
+			nfit_mem->nfit_flush = nfit_flush;
+			break;
+		}
 		break;
 	}
 
@@ -929,7 +950,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 
 	/* mmio->base must be mapped uncacheable */
 	writeq(cmd, mmio->base + offset);
-	persistent_sync();
+	nfit_blk->psync(nfit_blk);
 	/* FIXME: conditionally perform read-back if mandated by firmware */
 }
 
@@ -970,7 +991,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
 	}
 
 	if (write)
-		persistent_sync();
+		nfit_blk->psync(nfit_blk);
 
 	rc = read_blk_stat(nfit_blk, bw) ? -EIO : 0;
 	return rc;
@@ -1130,6 +1151,7 @@ static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev
 	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
 	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
 	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_flush *nfit_flush;
 	struct nfit_blk_mmio *mmio;
 	struct nfit_blk *nfit_blk;
 	struct nfit_mem *nfit_mem;
@@ -1195,6 +1217,24 @@ static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev
 		return rc;
 	}
 
+	nfit_flush = nfit_mem->nfit_flush;
+	if (nfit_flush && nfit_flush->flush->hint_count != 0) {
+		struct acpi_nfit_flush_address *flush = nfit_flush->flush;
+		struct resource res;
+
+		res.start = flush->hint_address[0];
+		res.end = flush->hint_address[0] + sizeof(u64) - 1;
+		res.name = dev_name(dev);
+		res.flags = IORESOURCE_MEM;
+
+		/* only need a single hint address.  map uncacheable */
+		nfit_blk->flush_hint = devm_ioremap_resource(dev, &res);
+		if (IS_ERR(nfit_blk->flush_hint))
+			return -ENOMEM;
+		nfit_blk->psync = directed_psync;
+	} else
+		nfit_blk->psync = global_psync;
+
 	if (mmio->line_size == 0)
 		return 0;
 
@@ -1349,6 +1389,7 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
 	INIT_LIST_HEAD(&acpi_desc->bdws);
 	INIT_LIST_HEAD(&acpi_desc->idts);
+	INIT_LIST_HEAD(&acpi_desc->flushes);
 	INIT_LIST_HEAD(&acpi_desc->memdevs);
 	INIT_LIST_HEAD(&acpi_desc->dimms);
 	mutex_init(&acpi_desc->spa_map_mutex);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b882a22ee7bb..858719017bef 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -18,6 +18,7 @@
 #include <linux/libnd.h>
 #include <linux/uuid.h>
 #include <linux/acpi.h>
+#include <linux/pmem.h>
 #include <acpi/acuuid.h>
 
 #define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
@@ -66,6 +67,11 @@ struct nfit_idt {
 	struct list_head list;
 };
 
+struct nfit_flush {
+	struct acpi_nfit_flush_address *flush;
+	struct list_head list;
+};
+
 struct nfit_memdev {
 	struct acpi_nfit_memory_map *memdev;
 	struct list_head list;
@@ -83,6 +89,7 @@ struct nfit_mem {
 	struct acpi_nfit_system_address *spa_bdw;
 	struct acpi_nfit_interleave *idt_dcr;
 	struct acpi_nfit_interleave *idt_bdw;
+	struct nfit_flush *nfit_flush;
 	struct list_head list;
 	struct acpi_device *adev;
 	unsigned long dsm_mask;
@@ -94,6 +101,7 @@ struct acpi_nfit_desc {
 	struct mutex spa_map_mutex;
 	struct list_head spa_maps;
 	struct list_head memdevs;
+	struct list_head flushes;
 	struct list_head dimms;
 	struct list_head spas;
 	struct list_head dcrs;
@@ -131,6 +139,8 @@ struct nfit_blk {
 	u64 bdw_offset; /* post interleave offset */
 	u64 stat_offset;
 	u64 cmd_offset;
+	void __iomem *flush_hint;
+	void (*psync)(struct nfit_blk *nfit_blk);
 };
 
 struct nfit_spa_mapping {
@@ -158,6 +168,18 @@ static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_d
 	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
 }
 
+static inline void directed_psync(struct nfit_blk *nfit_blk)
+{
+	wmb(); /* order previous writes */
+	writeq(1, nfit_blk->flush_hint); /* flush_hint must be mapped UC */
+	wmb(); /* order the write to the flush_hint */
+}
+
+static inline void global_psync(struct nfit_blk *nfit_blk)
+{
+	persistent_sync();
+}
+
 const u8 *to_nfit_uuid(enum nfit_uuids id);
 int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
 #endif /* __NFIT_H__ */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] nd_blk: add support for NVDIMM flags
  2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
                   ` (4 preceding siblings ...)
  2015-05-28 22:35 ` [PATCH 5/6] nd_blk: add support for flush hints Ross Zwisler
@ 2015-05-28 22:35 ` Ross Zwisler
  5 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-05-28 22:35 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-nvdimm
  Cc: Ross Zwisler, Dan Williams, Rafael J. Wysocki, linux-api

Add support in the ND_BLK I/O path for the "latch" and "flush" flags
defined in the "Get Block NVDIMM Flags" _DSM function:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-nvdimm@lists.01.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-api@vger.kernel.org
---
 drivers/acpi/nfit.c        | 26 ++++++++++++++++++++++++--
 drivers/acpi/nfit.h        |  1 +
 include/uapi/linux/ndctl.h |  5 +++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 6cca50d1c690..0da84e067643 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -951,7 +951,9 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 	/* mmio->base must be mapped uncacheable */
 	writeq(cmd, mmio->base + offset);
 	nfit_blk->psync(nfit_blk);
-	/* FIXME: conditionally perform read-back if mandated by firmware */
+
+	if (nfit_blk->dimm_flags & ND_BLK_DCR_LATCH)
+		readq(mmio->base + offset);
 }
 
 static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
@@ -983,8 +985,12 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
 
 		if (write)
 			persistent_copy(mmio->aperture + offset, iobuf + copied, c);
-		else
+		else {
+			if (nfit_blk->dimm_flags & ND_BLK_FLUSH_REQUIRED)
+				persistent_flush(mmio->aperture + offset, c);
+
 			memcpy(iobuf + copied, mmio->aperture + offset, c);
+		}
 
 		copied += c;
 		len -= c;
@@ -1146,6 +1152,20 @@ static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
 	return 0;
 }
 
+static void nd_acpi_blk_get_flags(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, struct nfit_blk *nfit_blk)
+{
+	struct nd_cmd_dimm_flags flags;
+	int rc;
+
+	memset(&flags, 0, sizeof(flags));
+	rc = nd_desc->ndctl(nd_desc, nd_dimm, ND_CMD_DIMM_FLAGS, &flags,
+			sizeof(flags));
+
+	if (rc >= 0 && !flags.status)
+		nfit_blk->dimm_flags = flags.flags;
+}
+
 static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev)
 {
 	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
@@ -1217,6 +1237,8 @@ static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev
 		return rc;
 	}
 
+	nd_acpi_blk_get_flags(nd_desc, nd_dimm, nfit_blk);
+
 	nfit_flush = nfit_mem->nfit_flush;
 	if (nfit_flush && nfit_flush->flush->hint_count != 0) {
 		struct acpi_nfit_flush_address *flush = nfit_flush->flush;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 858719017bef..d22b04ac3d2c 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -139,6 +139,7 @@ struct nfit_blk {
 	u64 bdw_offset; /* post interleave offset */
 	u64 stat_offset;
 	u64 cmd_offset;
+	u32 dimm_flags;
 	void __iomem *flush_hint;
 	void (*psync)(struct nfit_blk *nfit_blk);
 };
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index e595751c613d..bb0fa1c611e0 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -30,6 +30,11 @@ struct nd_cmd_dimm_flags {
 	__u32 flags;
 } __packed;
 
+enum {
+	ND_BLK_FLUSH_REQUIRED = 1,
+	ND_BLK_DCR_LATCH = 2,
+};
+
 struct nd_cmd_get_config_size {
 	__u32 status;
 	__u32 config_size;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] nfit: Fix up address spaces, sparse warnings
  2015-05-28 22:35 ` [PATCH 2/6] nfit: Fix up address spaces, sparse warnings Ross Zwisler
@ 2015-05-28 22:40   ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-05-28 22:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Linux ACPI, linux-nvdimm@lists.01.org, Rafael J. Wysocki

On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Fix a couple of sparse warnings (duplicate const, incorrect address
> space), and go back to using memcpy() instead of memcpy_toio() and
> memcpy_fromio() when talking to our block apertures.  Instead, include a
> union to alias mmio->base and mmio->aperture so that we can continue to
> reuse common code for ioremapping and deinterleaving.  mmio->base still
> has the __iomem annotation and is used via readq() and writeq() for the
> control and status registers.  mmio->aperture is used via normal
> memcpy() for aperture I/O.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-acpi@vger.kernel.org

Folded into libnd-for-next.

Thanks Ross!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: add force casts to avoid __iomem annotation
  2015-05-28 22:35 ` [PATCH 1/6] pmem: add force casts to avoid __iomem annotation Ross Zwisler
@ 2015-05-28 22:47   ` Dan Williams
  2015-05-29 11:39     ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-05-28 22:47 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-kernel, Linux ACPI, linux-nvdimm@lists.01.org

On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Even though we use ioremap_nocache() to map our persistent memory in the
> pmem driver, the memory we are mapping behaves like normal memory and
> not I/O memory in that it can be accessed using regular memcpy()
> operations and doesn't need to go through memcpy_toio() and
> memcpy_fromio().  Force casting the pointers received from
> ioremap_nocache() and given to iounmap() gives us the correct behavior
> and makes sparse happy.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org
> ---
>  drivers/block/nd/pmem.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
> index 5e8c9c629f22..a8712e41e7f5 100644
> --- a/drivers/block/nd/pmem.c
> +++ b/drivers/block/nd/pmem.c
> @@ -163,7 +163,8 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
>          * of the CPU caches in case of a crash.
>          */
>         err = -ENOMEM;
> -       pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
> +       pmem->virt_addr = (__force void *)ioremap_nocache(pmem->phys_addr,
> +                       pmem->size);

I think I'd rather see casting when ->virt_addr is used (the
__io_virt() helper can be used to make this a tad cleaner), or provide
ioremap apis that don't attach __iomem to their return value.  Because
in this and other cases ioremap() is being on non "i/o" memory.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-28 22:35 ` [PATCH 3/6] x86, pmem: add PMEM API for persistent memory Ross Zwisler
@ 2015-05-28 23:20   ` H. Peter Anvin
  2015-05-29  0:02     ` Dan Williams
  2015-05-29 12:07     ` Ross Zwisler
  0 siblings, 2 replies; 19+ messages in thread
From: H. Peter Anvin @ 2015-05-28 23:20 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, linux-acpi, linux-nvdimm
  Cc: Dan Williams, Thomas Gleixner, Ingo Molnar, x86

On 05/28/2015 03:35 PM, Ross Zwisler wrote:
> Add a new PMEM API to x86, and allow for architectures that do not
> implement this API.  Architectures that implement the PMEM API should
> define ARCH_HAS_PMEM_API in their kernel configuration and must provide
> implementations for persistent_copy(), persistent_flush() and
> persistent_sync().

>  
>  void clflush_cache_range(void *addr, unsigned int size);
>  

No, no, no, no, no.  Include the proper header file.

> +static inline void arch_persistent_flush(void *vaddr, size_t size)
> +{
> +	clflush_cache_range(vaddr, size);
> +}

Shouldn't this really be using clwb() -- we really need a
clwb_cache_range() I guess?

Incidentally, clflush_cache_range() seems to have the same flaw as the
proposed use case for clwb() had... if the buffer is aligned it will
needlessly flush the last line twice.  It should really look something
like this (which would be a good standalone patch):

void clflush_cache_range(void *vaddr, unsigned int size)
{
        void *vend = vaddr + size - 1;

        mb();

	vaddr = (void *)
		((unsigned long)vaddr
		 & ~(boot_cpu_data.x86_clflush_size - 1));

        for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
                clflushopt(vaddr);

        mb();
}
EXPORT_SYMBOL_GPL(clflush_cache_range);

I also note that with your implementation we have a wmb() in
arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
one is redundant?

	-hpa



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-28 23:20   ` H. Peter Anvin
@ 2015-05-29  0:02     ` Dan Williams
  2015-05-29  4:19       ` H. Peter Anvin
  2015-05-29 12:07     ` Ross Zwisler
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-05-29  0:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ross Zwisler, linux-kernel, Linux ACPI, linux-nvdimm,
	Thomas Gleixner, Ingo Molnar, X86 ML

On Thu, May 28, 2015 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
>> Add a new PMEM API to x86, and allow for architectures that do not
>> implement this API.  Architectures that implement the PMEM API should
>> define ARCH_HAS_PMEM_API in their kernel configuration and must provide
>> implementations for persistent_copy(), persistent_flush() and
>> persistent_sync().
>
>>
>>  void clflush_cache_range(void *addr, unsigned int size);
>>
>
> No, no, no, no, no.  Include the proper header file.
>
>> +static inline void arch_persistent_flush(void *vaddr, size_t size)
>> +{
>> +     clflush_cache_range(vaddr, size);
>> +}
>
> Shouldn't this really be using clwb() -- we really need a
> clwb_cache_range() I guess?
>
> Incidentally, clflush_cache_range() seems to have the same flaw as the
> proposed use case for clwb() had... if the buffer is aligned it will
> needlessly flush the last line twice.  It should really look something
> like this (which would be a good standalone patch):
>
> void clflush_cache_range(void *vaddr, unsigned int size)
> {
>         void *vend = vaddr + size - 1;
>
>         mb();
>
>         vaddr = (void *)
>                 ((unsigned long)vaddr
>                  & ~(boot_cpu_data.x86_clflush_size - 1));
>
>         for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
>                 clflushopt(vaddr);
>
>         mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> I also note that with your implementation we have a wmb() in
> arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
> one is redundant?

Hmm, yes, but I believe Ross (on vacation now) was following the
precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
mfence" whereby the api handles all necessary fencing internally.
Shall we introduce something like __unordered_clflush_cache_range()
for arch_persistent_flush() to use with the understanding it will be
following up with the wmb() in arch_persistent_sync()?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-29  0:02     ` Dan Williams
@ 2015-05-29  4:19       ` H. Peter Anvin
  2015-05-29 12:11         ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: H. Peter Anvin @ 2015-05-29  4:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, Linux ACPI, linux-nvdimm,
	Thomas Gleixner, Ingo Molnar, X86 ML

On 05/28/2015 05:02 PM, Dan Williams wrote:
> 
> Hmm, yes, but I believe Ross (on vacation now) was following the
> precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
> mfence" whereby the api handles all necessary fencing internally.
> Shall we introduce something like __unordered_clflush_cache_range()
> for arch_persistent_flush() to use with the understanding it will be
> following up with the wmb() in arch_persistent_sync()?
> 

Are we ever going to have arch_persistent_sync() without
arch_persistent_flush()?

However, thinking about it, it would be more efficient to do all flushes
first and then have a single barrier.

	-hpa


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: add force casts to avoid __iomem annotation
  2015-05-28 22:47   ` Dan Williams
@ 2015-05-29 11:39     ` Ross Zwisler
  2015-05-29 12:53       ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-29 11:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Linux ACPI, linux-kernel, linux-nvdimm@lists.01.org

On Thu, 2015-05-28 at 15:47 -0700, Dan Williams wrote:
> On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > Even though we use ioremap_nocache() to map our persistent memory 
> > in the
> > pmem driver, the memory we are mapping behaves like normal memory 
> > and
> > not I/O memory in that it can be accessed using regular memcpy()
> > operations and doesn't need to go through memcpy_toio() and
> > memcpy_fromio().  Force casting the pointers received from
> > ioremap_nocache() and given to iounmap() gives us the correct 
> > behavior
> > and makes sparse happy.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-nvdimm@lists.01.org
> > ---
> >  drivers/block/nd/pmem.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
> > index 5e8c9c629f22..a8712e41e7f5 100644
> > --- a/drivers/block/nd/pmem.c
> > +++ b/drivers/block/nd/pmem.c
> > @@ -163,7 +163,8 @@ static struct pmem_device *pmem_alloc(struct 
> > device *dev, struct resource *res,
> >          * of the CPU caches in case of a crash.
> >          */
> >         err = -ENOMEM;
> > -       pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem
> > ->size);
> > +       pmem->virt_addr = (__force void *)ioremap_nocache(pmem
> > ->phys_addr,
> > +                       pmem->size);
> 
> I think I'd rather see casting when ->virt_addr is used (the
> __io_virt() helper can be used to make this a tad cleaner), or 
> provide
> ioremap apis that don't attach __iomem to their return value. 
>  Because
> in this and other cases ioremap() is being on non "i/o" memory.

The reason that I thought this was cleaner was that now when you look
at the pmem->virt_addr definition it is just a clean void* with no
annotations.  This correctly describes the memory to the user (it's
usable as regular memory, it's in the kernel address space, etc.).

Having the pointer itself annotated with __iomem feels weird to me
because a random well meaning user could incorrectly try to use it as
I/O memory.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-28 23:20   ` H. Peter Anvin
  2015-05-29  0:02     ` Dan Williams
@ 2015-05-29 12:07     ` Ross Zwisler
  2015-05-29 15:48       ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-05-29 12:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ross Zwisler, linux-kernel, linux-acpi, linux-nvdimm, x86,
	Ingo Molnar, Thomas Gleixner

On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
> > Add a new PMEM API to x86, and allow for architectures that do not
> > implement this API.  Architectures that implement the PMEM API 
> > should
> > define ARCH_HAS_PMEM_API in their kernel configuration and must 
> > provide
> > implementations for persistent_copy(), persistent_flush() and
> > persistent_sync().
> 
> >  
> >  void clflush_cache_range(void *addr, unsigned int size);
> >  
> 
> No, no, no, no, no.  Include the proper header file.

I'm confused - I did inlcude <asm/cacheflush.h> in pmem.h?  The line
you're quoting above was an unmodified line from asm/cacheflush.h - I
didn't redefine the prototype for clflush_cache_range() anywhere.

Or does this comment mean that you think we shouldn't have an
architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
drivers should just directly include asm/cacheflush.h and use the x86
APIs directly?

> > +static inline void arch_persistent_flush(void *vaddr, size_t size)
> > +{
> > +   clflush_cache_range(vaddr, size);
> > +}
> 
> Shouldn't this really be using clwb() -- we really need a
> clwb_cache_range() I guess?

I think we will need a clwb_cache_range() for DAX, for when it responds
to a msync() or fsync() call and needs to rip through a bunch of
memory, writing it back to the DIMMs.  I just didn't add it yet because
I didn't have a consumer.

It turns out that for the block aperture I/O case we really do need a
flush instead of a writeback, though, so clflush_cache_range() is
perfect.  Here's the flow, which is a read from a block window
aperture:

1) The nd_blk driver gets a read request, and selects a block window to
use.  It's entirely possible that this block window's aperture has
clean cache lines associated with it in the processor cache hierarchy. 
 It shouldn't be possible that it has dirty cache lines - we either
just did a read, or we did a write and would have used NT stores.

2) Write a new address into the block window control register.  The
memory backing the aperture moves to the new address.  Any clean lines
held in the processor cache are now out of sync.

3) Flush the cache lines associated with the aperture.  The lines are
guaranteed to be clean, so the flush will just discard them and no
writebacks will occur.

4) Read the contents of the block aperture, servicing the read.

This is the read flow outlined it the "driver writer's guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf  

> Incidentally, clflush_cache_range() seems to have the same flaw as 
> the
> proposed use case for clwb() had... if the buffer is aligned it will
> needlessly flush the last line twice.  It should really look 
> something
> like this (which would be a good standalone patch):
> 
> void clflush_cache_range(void *vaddr, unsigned int size)
> {
>         void *vend = vaddr + size - 1;
> 
>         mb();
> 
>       vaddr = (void *)
>               ((unsigned long)vaddr
>                & ~(boot_cpu_data.x86_clflush_size - 1));
> 
>         for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
>                 clflushopt(vaddr);
> 
>         mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);

Ah, yep, I saw the same thing and already submitted patches to fix.  I
think this change should be in the TIP tree:

https://lkml.org/lkml/2015/5/11/336


> I also note that with your implementation we have a wmb() in
> arch_persistent_sync() and an mb() in arch_persistent_flush()... 
> surely one is redundant?

Actually, the way that we need to use arch_persistent_flush() for our
block window read case, the fencing works out so that nothing is
redundant.  We never actually use both a persistent_sync() call and a
persistent_flush() call during the same I/O.  Reads use
persistent_flush() to invalidate obsolete lines in the cache before
reading real data from the aperture of ete DIMM, and writes use a bunch
of NT stores followed by a persistent_sync().

The PMEM driver doesn't use persistent_flush() at all - this API is
only needed for the block window read case.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-29  4:19       ` H. Peter Anvin
@ 2015-05-29 12:11         ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-05-29 12:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dan Williams, linux-nvdimm, X86 ML, linux-kernel, Linux ACPI,
	Ingo Molnar, Thomas Gleixner

On Thu, 2015-05-28 at 21:19 -0700, H. Peter Anvin wrote:
> On 05/28/2015 05:02 PM, Dan Williams wrote:
> > 
> > Hmm, yes, but I believe Ross (on vacation now) was following the
> > precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
> > mfence" whereby the api handles all necessary fencing internally.
> > Shall we introduce something like __unordered_clflush_cache_range()
> > for arch_persistent_flush() to use with the understanding it will 
> > be
> > following up with the wmb() in arch_persistent_sync()?
> > 
> 
> Are we ever going to have arch_persistent_sync() without
> arch_persistent_flush()?
> 
> However, thinking about it, it would be more efficient to do all 
> flushes
> first and then have a single barrier.

Yep, we have arch_persistent_sync() without arch_persistent_flush() in
both our PMEM and ND_BLK write paths.  These use arch_persistent_copy() to get NT stores, so they don't need to manually flush/write-back 
before doing a persistent_sync().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: add force casts to avoid __iomem annotation
  2015-05-29 11:39     ` Ross Zwisler
@ 2015-05-29 12:53       ` Dan Williams
  2015-05-29 13:22         ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-05-29 12:53 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Ross Zwisler, Linux ACPI, linux-kernel, linux-nvdimm@lists.01.org

On Fri, May 29, 2015 at 4:39 AM, Ross Zwisler <zwisler@gmail.com> wrote:
> On Thu, 2015-05-28 at 15:47 -0700, Dan Williams wrote:
>> On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > Even though we use ioremap_nocache() to map our persistent memory
>> > in the
>> > pmem driver, the memory we are mapping behaves like normal memory
>> > and
>> > not I/O memory in that it can be accessed using regular memcpy()
>> > operations and doesn't need to go through memcpy_toio() and
>> > memcpy_fromio().  Force casting the pointers received from
>> > ioremap_nocache() and given to iounmap() gives us the correct
>> > behavior
>> > and makes sparse happy.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > Cc: Dan Williams <dan.j.williams@intel.com>
>> > Cc: linux-nvdimm@lists.01.org
>> > ---
>> >  drivers/block/nd/pmem.c | 7 ++++---
>> >  1 file changed, 4 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
>> > index 5e8c9c629f22..a8712e41e7f5 100644
>> > --- a/drivers/block/nd/pmem.c
>> > +++ b/drivers/block/nd/pmem.c
>> > @@ -163,7 +163,8 @@ static struct pmem_device *pmem_alloc(struct
>> > device *dev, struct resource *res,
>> >          * of the CPU caches in case of a crash.
>> >          */
>> >         err = -ENOMEM;
>> > -       pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem
>> > ->size);
>> > +       pmem->virt_addr = (__force void *)ioremap_nocache(pmem
>> > ->phys_addr,
>> > +                       pmem->size);
>>
>> I think I'd rather see casting when ->virt_addr is used (the
>> __io_virt() helper can be used to make this a tad cleaner), or
>> provide
>> ioremap apis that don't attach __iomem to their return value.
>>  Because
>> in this and other cases ioremap() is being on non "i/o" memory.
>
> The reason that I thought this was cleaner was that now when you look
> at the pmem->virt_addr definition it is just a clean void* with no
> annotations.  This correctly describes the memory to the user (it's
> usable as regular memory, it's in the kernel address space, etc.).
>
> Having the pointer itself annotated with __iomem feels weird to me
> because a random well meaning user could incorrectly try to use it as
> I/O memory.

pmem->virt_addr does not leak outside the driver to random well
meaning users.  I think we have two options, provide physical address
remap helpers from the outset (memremap()?) that don't attach __iomem,
or put the casts on the non-iomem usages.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] pmem: add force casts to avoid __iomem annotation
  2015-05-29 12:53       ` Dan Williams
@ 2015-05-29 13:22         ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-05-29 13:22 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Ross Zwisler, Linux ACPI, linux-kernel, linux-nvdimm@lists.01.org

On Fri, May 29, 2015 at 5:53 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, May 29, 2015 at 4:39 AM, Ross Zwisler <zwisler@gmail.com> wrote:
>> On Thu, 2015-05-28 at 15:47 -0700, Dan Williams wrote:
>>> On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
>>> <ross.zwisler@linux.intel.com> wrote:
>>> > Even though we use ioremap_nocache() to map our persistent memory
>>> > in the
>>> > pmem driver, the memory we are mapping behaves like normal memory
>>> > and
>>> > not I/O memory in that it can be accessed using regular memcpy()
>>> > operations and doesn't need to go through memcpy_toio() and
>>> > memcpy_fromio().  Force casting the pointers received from
>>> > ioremap_nocache() and given to iounmap() gives us the correct
>>> > behavior
>>> > and makes sparse happy.
>>> >
>>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> > Cc: Dan Williams <dan.j.williams@intel.com>
>>> > Cc: linux-nvdimm@lists.01.org
>>> > ---
>>> >  drivers/block/nd/pmem.c | 7 ++++---
>>> >  1 file changed, 4 insertions(+), 3 deletions(-)
>>> >
>>> > diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
>>> > index 5e8c9c629f22..a8712e41e7f5 100644
>>> > --- a/drivers/block/nd/pmem.c
>>> > +++ b/drivers/block/nd/pmem.c
>>> > @@ -163,7 +163,8 @@ static struct pmem_device *pmem_alloc(struct
>>> > device *dev, struct resource *res,
>>> >          * of the CPU caches in case of a crash.
>>> >          */
>>> >         err = -ENOMEM;
>>> > -       pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem
>>> > ->size);
>>> > +       pmem->virt_addr = (__force void *)ioremap_nocache(pmem
>>> > ->phys_addr,
>>> > +                       pmem->size);
>>>
>>> I think I'd rather see casting when ->virt_addr is used (the
>>> __io_virt() helper can be used to make this a tad cleaner), or
>>> provide
>>> ioremap apis that don't attach __iomem to their return value.
>>>  Because
>>> in this and other cases ioremap() is being on non "i/o" memory.
>>
>> The reason that I thought this was cleaner was that now when you look
>> at the pmem->virt_addr definition it is just a clean void* with no
>> annotations.  This correctly describes the memory to the user (it's
>> usable as regular memory, it's in the kernel address space, etc.).
>>
>> Having the pointer itself annotated with __iomem feels weird to me
>> because a random well meaning user could incorrectly try to use it as
>> I/O memory.
>
> pmem->virt_addr does not leak outside the driver to random well
> meaning users.  I think we have two options, provide physical address
> remap helpers from the outset (memremap()?) that don't attach __iomem,
> or put the casts on the non-iomem usages.

In other words, I don't like the fact that we apply a coarse / private
hack to fix a general kernel mis-annotation problem.  Either fix it
for everybody with something like memremap() or use __io_virt() to
document the non-iomem usages of pmem->virt_addr like the other
iomem-to-undecorated pointer helper routines in the kernel.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API
  2015-05-28 22:35 ` [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API Ross Zwisler
@ 2015-05-29 14:11   ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-05-29 14:11 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Linux ACPI, linux-nvdimm@lists.01.org, Rafael J. Wysocki

On Thu, May 28, 2015 at 3:35 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Update the PMEM and ND_BLK I/O paths to use the new PMEM API and to
> adhere to the read/write flows outlined in the "NVDIMM Block Window
> Driver Writer's Guide":
>
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: linux-acpi@vger.kernel.org
> ---
>  drivers/acpi/nfit.c     | 10 ++++++++--
>  drivers/block/nd/pmem.c | 15 +++++++++++----
>  2 files changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index df14652ea13e..22df61968c1c 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -18,6 +18,7 @@
>  #include <linux/list.h>
>  #include <linux/acpi.h>
>  #include <linux/sort.h>
> +#include <linux/pmem.h>
>  #include <linux/io.h>
>  #include "nfit.h"
>
> @@ -926,7 +927,9 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
>         if (mmio->num_lines)
>                 offset = to_interleave_offset(offset, mmio);
>
> +       /* mmio->base must be mapped uncacheable */
>         writeq(cmd, mmio->base + offset);
> +       persistent_sync();
>         /* FIXME: conditionally perform read-back if mandated by firmware */
>  }
>
> @@ -940,7 +943,6 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
>         int rc;
>
>         base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES + bw * mmio->size;
> -       /* TODO: non-temporal access, flush hints, cache management etc... */
>         write_blk_ctl(nfit_blk, bw, dpa, len, write);
>         while (len) {
>                 unsigned int c;
> @@ -959,13 +961,17 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
>                 }
>
>                 if (write)
> -                       memcpy(mmio->aperture + offset, iobuf + copied, c);
> +                       persistent_copy(mmio->aperture + offset, iobuf + copied, c);
>                 else
>                         memcpy(iobuf + copied, mmio->aperture + offset, c);
>
>                 copied += c;
>                 len -= c;
>         }
> +
> +       if (write)
> +               persistent_sync();
> +
>         rc = read_blk_stat(nfit_blk, bw) ? -EIO : 0;
>         return rc;
>  }
> diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
> index a8712e41e7f5..1a447c351aef 100644
> --- a/drivers/block/nd/pmem.c
> +++ b/drivers/block/nd/pmem.c
> @@ -15,15 +15,16 @@
>   * more details.
>   */
>
> -#include <asm/cacheflush.h>
>  #include <linux/blkdev.h>
>  #include <linux/hdreg.h>
>  #include <linux/init.h>
>  #include <linux/platform_device.h>
>  #include <linux/module.h>
>  #include <linux/moduleparam.h>
> +#include <linux/pmem.h>
>  #include <linux/slab.h>
>  #include <linux/nd.h>
> +#include <asm/cacheflush.h>
>  #include "nd.h"
>
>  struct pmem_device {
> @@ -51,7 +52,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
>                 flush_dcache_page(page);
>         } else {
>                 flush_dcache_page(page);
> -               memcpy(pmem->virt_addr + pmem_off, mem + off, len);
> +               persistent_copy(pmem->virt_addr + pmem_off, mem + off, len);
>         }
>
>         kunmap_atomic(mem);
> @@ -82,6 +83,8 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
>                 sector += bvec.bv_len >> 9;
>         }
>
> +       if (rw & WRITE)
> +               persistent_sync();
>  out:
>         bio_endio(bio, err);
>  }
> @@ -92,6 +95,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>         struct pmem_device *pmem = bdev->bd_disk->private_data;
>
>         pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
> +       if (rw & WRITE)
> +               persistent_sync();
>         page_endio(page, rw & WRITE, 0);
>
>         return 0;
> @@ -111,8 +116,10 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
>
>         if (rw == READ)
>                 memcpy(buf, pmem->virt_addr + offset, n);
> -       else
> -               memcpy(pmem->virt_addr + offset, buf, n);
> +       else {
> +               persistent_copy(pmem->virt_addr + offset, buf, n);
> +               persistent_sync();
> +       }
>

It bothers me that persistent_sync() is a lie for almost every system
on the planet.  Even though uncached mappings are not sufficient for
guaranteeing persistence the potential damage is minimized.   To me it
seems the driver should have separate I/O paths for the persistent and
non-persistent case determined and notified at init time.  I.e.
arrange to never call persistent_sync() when it is known to be a nop
and tell the user "btw, your data is at risk".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory
  2015-05-29 12:07     ` Ross Zwisler
@ 2015-05-29 15:48       ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-05-29 15:48 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: H. Peter Anvin, linux-nvdimm, X86 ML, linux-kernel, Linux ACPI,
	Ingo Molnar, Thomas Gleixner

On Fri, May 29, 2015 at 5:07 AM, Ross Zwisler <zwisler@gmail.com> wrote:
> On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
>> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
>> > Add a new PMEM API to x86, and allow for architectures that do not
>> > implement this API.  Architectures that implement the PMEM API
>> > should
>> > define ARCH_HAS_PMEM_API in their kernel configuration and must
>> > provide
>> > implementations for persistent_copy(), persistent_flush() and
>> > persistent_sync().
>>
>> >
>> >  void clflush_cache_range(void *addr, unsigned int size);
>> >
>>
>> No, no, no, no, no.  Include the proper header file.
>
> I'm confused - I did inlcude <asm/cacheflush.h> in pmem.h?  The line
> you're quoting above was an unmodified line from asm/cacheflush.h - I
> didn't redefine the prototype for clflush_cache_range() anywhere.
>
> Or does this comment mean that you think we shouldn't have an
> architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
> drivers should just directly include asm/cacheflush.h and use the x86
> APIs directly?
>
>> > +static inline void arch_persistent_flush(void *vaddr, size_t size)
>> > +{
>> > +   clflush_cache_range(vaddr, size);
>> > +}
>>
>> Shouldn't this really be using clwb() -- we really need a
>> clwb_cache_range() I guess?
>
> I think we will need a clwb_cache_range() for DAX, for when it responds
> to a msync() or fsync() call and needs to rip through a bunch of
> memory, writing it back to the DIMMs.  I just didn't add it yet because
> I didn't have a consumer.
>
> It turns out that for the block aperture I/O case we really do need a
> flush instead of a writeback, though, so clflush_cache_range() is
> perfect.  Here's the flow, which is a read from a block window
> aperture:
>
> 1) The nd_blk driver gets a read request, and selects a block window to
> use.  It's entirely possible that this block window's aperture has
> clean cache lines associated with it in the processor cache hierarchy.
>  It shouldn't be possible that it has dirty cache lines - we either
> just did a read, or we did a write and would have used NT stores.
>
> 2) Write a new address into the block window control register.  The
> memory backing the aperture moves to the new address.  Any clean lines
> held in the processor cache are now out of sync.
>
> 3) Flush the cache lines associated with the aperture.  The lines are
> guaranteed to be clean, so the flush will just discard them and no
> writebacks will occur.
>
> 4) Read the contents of the block aperture, servicing the read.
>
> This is the read flow outlined it the "driver writer's guide":
>
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
>> Incidentally, clflush_cache_range() seems to have the same flaw as
>> the
>> proposed use case for clwb() had... if the buffer is aligned it will
>> needlessly flush the last line twice.  It should really look
>> something
>> like this (which would be a good standalone patch):
>>
>> void clflush_cache_range(void *vaddr, unsigned int size)
>> {
>>         void *vend = vaddr + size - 1;
>>
>>         mb();
>>
>>       vaddr = (void *)
>>               ((unsigned long)vaddr
>>                & ~(boot_cpu_data.x86_clflush_size - 1));
>>
>>         for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
>>                 clflushopt(vaddr);
>>
>>         mb();
>> }
>> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> Ah, yep, I saw the same thing and already submitted patches to fix.  I
> think this change should be in the TIP tree:
>
> https://lkml.org/lkml/2015/5/11/336
>
>
>> I also note that with your implementation we have a wmb() in
>> arch_persistent_sync() and an mb() in arch_persistent_flush()...
>> surely one is redundant?
>
> Actually, the way that we need to use arch_persistent_flush() for our
> block window read case, the fencing works out so that nothing is
> redundant.  We never actually use both a persistent_sync() call and a
> persistent_flush() call during the same I/O.  Reads use
> persistent_flush() to invalidate obsolete lines in the cache before
> reading real data from the aperture of ete DIMM, and writes use a bunch
> of NT stores followed by a persistent_sync().
>
> The PMEM driver doesn't use persistent_flush() at all - this API is
> only needed for the block window read case.

Then that's not a "persistence flush", that's a shootdown of the
previous mmio block window setting.  If it's only for BLK reads I
think we need to use clflush_cache_range() directly.  Given that BLK
mode already depends on ACPI I think it's fine for now to make BLK
mode depend on x86.  Otherwise, we need a new cross-arch generic cache
flush primitive like io_flush_cache_range() and have BLK mode depend
on ARCH_HAS_IO_FLUSH_CACHE_RANGE.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-05-29 15:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-28 22:35 [PATCH 0/6] I/O path improvements for ND_BLK and PMEM Ross Zwisler
2015-05-28 22:35 ` [PATCH 1/6] pmem: add force casts to avoid __iomem annotation Ross Zwisler
2015-05-28 22:47   ` Dan Williams
2015-05-29 11:39     ` Ross Zwisler
2015-05-29 12:53       ` Dan Williams
2015-05-29 13:22         ` Dan Williams
2015-05-28 22:35 ` [PATCH 2/6] nfit: Fix up address spaces, sparse warnings Ross Zwisler
2015-05-28 22:40   ` Dan Williams
2015-05-28 22:35 ` [PATCH 3/6] x86, pmem: add PMEM API for persistent memory Ross Zwisler
2015-05-28 23:20   ` H. Peter Anvin
2015-05-29  0:02     ` Dan Williams
2015-05-29  4:19       ` H. Peter Anvin
2015-05-29 12:11         ` Ross Zwisler
2015-05-29 12:07     ` Ross Zwisler
2015-05-29 15:48       ` Dan Williams
2015-05-28 22:35 ` [PATCH 4/6] pmem, nd_blk: update I/O paths to use PMEM API Ross Zwisler
2015-05-29 14:11   ` Dan Williams
2015-05-28 22:35 ` [PATCH 5/6] nd_blk: add support for flush hints Ross Zwisler
2015-05-28 22:35 ` [PATCH 6/6] nd_blk: add support for NVDIMM flags Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).