[NAK] copy_from_iter

All of lore.kernel.org
 help / color / mirror / Atom feed

* [NAK] copy_from_iter_ops()
@ 2017-04-25  1:22 Al Viro
  2017-04-25  2:35 ` Dan Williams
  2017-04-26 21:56   ` Dan Williams
  0 siblings, 2 replies; 53+ messages in thread
From: Al Viro @ 2017-04-25  1:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Linus Torvalds,
	Christoph Hellwig

	I should have looked and commented earlier, but I hadn't spotted
that thing until -next conflicts had shown up.  As the matter of fact,
I don't have this series in my mailbox - it had been Cc'd my way, apparently,
but it looks like it never made it there, so I'm posting from scratch instead
of replying.  Sorry.

	The following "primitive" is complete crap

+#ifdef CONFIG_COPY_FROM_ITER_OPS
+size_t copy_from_iter_ops(void *addr, size_t bytes, struct iov_iter *i,
+               int (*user)(void *, const void __user *, unsigned),
+               void (*page)(char *, struct page *, size_t, size_t),
+               void (*copy)(void *, void *, unsigned))
+{
+       char *to = addr;
+
+       if (unlikely(i->type & ITER_PIPE)) {
+               WARN_ON(1);
+               return 0;
+       }
+       iterate_and_advance(i, bytes, v,
+               user((to += v.iov_len) - v.iov_len, v.iov_base,
+                                v.iov_len),
+               page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset,
+                               v.bv_len),
+               copy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+       )
+
+       return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_ops);
+#endif

	1) Every time we get a new copy-from flavour of iov_iter, you will
need an extra argument and every caller will need to be updated.

	2) If it's a general-purpose primitive, it should *not* be
behind a CONFIG_<whatever> to be selected by callers.  If it isn't,
it shouldn't be there at all, period.  And no, EXPORT_SYMBOL_GPL doesn't
make it any better.

	3) The caller makes very little sense.  Is that thing meant to
be x86-only?  What are the requirements regarding writeback?  Is that thing
just go-fast stripes, or...?  Basically, all questions asked back in Decemeber
thread (memcpy_nocache()) still apply.

	I strongly object to that interface.  Let's figure out what's
really needed for your copy_from_iter_pmem() and bloody put the
iterator-related part (without the callbacks, etc.) into lib/iov_iter.c
With memcpy_to_pmem() and pmem_from_user() used by it.

	Incidentally, your fallback for memcpy_to_pmem() is... odd.
It used to be "just use memcpy()" and now it's "just do nothing".  What
the hell?  If it's really "you should not use that if you don't have
arch-specific variant", let it at least BUG(), if not fail to link.

	On the uaccess side, should pmem_from_user() zero what it had failed
to copy?  And for !@#!@# sake, comments like this
+        * On x86_64 __copy_from_user_nocache() uses non-temporal stores
+        * for the bulk of the transfer, but we need to manually flush
+        * if the transfer is unaligned. A cached memory copy is used
+        * when destination or size is not naturally aligned. That is:
+        *   - Require 8-byte alignment when size is 8 bytes or larger.
+        *   - Require 4-byte alignment when size is 4 bytes.
mean only one thing: this should live in arch/x86/lib/usercopy_64.c,
right next to the actual function that does copying.  NOT in
drivers/nvdimm/x86.c.  At the very least it needs a comment in usercopy_64.c
with dire warnings along the lines of "don't touch that code without
looking into <filename>:pmem_from_user()"...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [NAK] copy_from_iter_ops()
  2017-04-25  1:22 [NAK] copy_from_iter_ops() Al Viro
@ 2017-04-25  2:35 ` Dan Williams
  2017-04-26 21:56   ` Dan Williams
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-25  2:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Linus Torvalds,
	Christoph Hellwig

On Mon, Apr 24, 2017 at 6:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>         I should have looked and commented earlier, but I hadn't spotted
> that thing until -next conflicts had shown up.  As the matter of fact,
> I don't have this series in my mailbox - it had been Cc'd my way, apparently,
> but it looks like it never made it there, so I'm posting from scratch instead
> of replying.  Sorry.
>
>         The following "primitive" is complete crap
>
> +#ifdef CONFIG_COPY_FROM_ITER_OPS
> +size_t copy_from_iter_ops(void *addr, size_t bytes, struct iov_iter *i,
> +               int (*user)(void *, const void __user *, unsigned),
> +               void (*page)(char *, struct page *, size_t, size_t),
> +               void (*copy)(void *, void *, unsigned))
> +{
> +       char *to = addr;
> +
> +       if (unlikely(i->type & ITER_PIPE)) {
> +               WARN_ON(1);
> +               return 0;
> +       }
> +       iterate_and_advance(i, bytes, v,
> +               user((to += v.iov_len) - v.iov_len, v.iov_base,
> +                                v.iov_len),
> +               page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset,
> +                               v.bv_len),
> +               copy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
> +       )
> +
> +       return bytes;
> +}
> +EXPORT_SYMBOL_GPL(copy_from_iter_ops);
> +#endif
>
>         1) Every time we get a new copy-from flavour of iov_iter, you will
> need an extra argument and every caller will need to be updated.
>
>         2) If it's a general-purpose primitive, it should *not* be
> behind a CONFIG_<whatever> to be selected by callers.  If it isn't,
> it shouldn't be there at all, period.  And no, EXPORT_SYMBOL_GPL doesn't
> make it any better.

Ok, that was only there to appease the config-tiny crowd that wouldn't
want to lib/iov_iter.c get bigger if pmem is turned off.

>
>         3) The caller makes very little sense.  Is that thing meant to
> be x86-only?  What are the requirements regarding writeback?  Is that thing
> just go-fast stripes, or...?  Basically, all questions asked back in Decemeber
> thread (memcpy_nocache()) still apply.

The caller is meant to be x86_64-only just like the "pmem api" it is
replacing, but other architectures can add pmem support in the same
template.  All of the _to_pmem() operations are expected to have all
data flushed out of the cache at completion. They can still be pending
in the cpu store buffer awaiting a future sfence which is wired to
take place when the block layer issues a REQ_FUA or REQ_FLUSH request.

>         I strongly object to that interface.  Let's figure out what's
> really needed for your copy_from_iter_pmem() and bloody put the
> iterator-related part (without the callbacks, etc.) into lib/iov_iter.c
> With memcpy_to_pmem() and pmem_from_user() used by it.

So this is the opposite of what I understood from Linus' comments, see below...

>         Incidentally, your fallback for memcpy_to_pmem() is... odd.
> It used to be "just use memcpy()" and now it's "just do nothing".  What
> the hell?  If it's really "you should not use that if you don't have
> arch-specific variant", let it at least BUG(), if not fail to link.

No, just a terrible oversight on my part. Should just be plain memcpy,
and I'll go add non-x86_64 testing to my regression suite now.

>
>         On the uaccess side, should pmem_from_user() zero what it had failed
> to copy?  And for !@#!@# sake, comments like this
> +        * On x86_64 __copy_from_user_nocache() uses non-temporal stores
> +        * for the bulk of the transfer, but we need to manually flush
> +        * if the transfer is unaligned. A cached memory copy is used
> +        * when destination or size is not naturally aligned. That is:
> +        *   - Require 8-byte alignment when size is 8 bytes or larger.
> +        *   - Require 4-byte alignment when size is 4 bytes.
> mean only one thing: this should live in arch/x86/lib/usercopy_64.c,
> right next to the actual function that does copying.  NOT in
> drivers/nvdimm/x86.c.  At the very least it needs a comment in usercopy_64.c
> with dire warnings along the lines of "don't touch that code without
> looking into <filename>:pmem_from_user()"...

So pushing this all into drivers/nvdimm/x86.c was my interpretation of
this from Linus the last time we talked about this.

   "Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
    copy_user_nocache() just needs to die. It's idiotic.

    As you point out, it's also fundamentally buggy crap.

    Throw it away. There is no possible way this is ever valid or
    portable. We're not going to lie and claim that it is.

    If some driver ends up using 'movnt' by hand, that is up to that
    *driver*. But no way in hell should we care about this one whit in
    the sense of <linux/uaccess.h>."
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
  2017-04-25  1:22 [NAK] copy_from_iter_ops() Al Viro
  2017-04-25  2:35 ` Dan Williams
@ 2017-04-26 21:56   ` Dan Williams
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-26 21:56 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, hch

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   13 ++++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    2 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    2 -
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   22 ++++++
 13 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WRITETHRU	if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_writethru(void *dst, const void __user *src,
+				  unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+				  size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+				  unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_writethru(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..144cb5e59193 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_writethru(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_writethru(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_writethru);
+
+void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_writethru(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..c84e242f91ed 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,7 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
+			memcpy_writethru(mmio->addr.aperture + offset,
 					iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..38822f6fa49f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_writethru(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..28dc82a595a5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_writethru(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_writethru(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHRU)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..b668ba455c39 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,7 +947,7 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
+	 * writes to avoid the cache via memcpy_writethru().  The
 	 * final wmb() ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..f4e166d88e2a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WRITETHRU
+static inline void memcpy_writethru(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..d284cb5e89fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_writethru() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_writethru(void *addr, size_t bytes,
+		struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..db31bc186df2 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WRITETHRU
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..afc3dc75346c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_writethru((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_writethru((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_writethru((to += v.iov_len) - v.iov_len, v.iov_base,
+			v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_writethru);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
@ 2017-04-26 21:56   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-26 21:56 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, Jeff Moyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Ross Zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   13 ++++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    2 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    2 -
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   22 ++++++
 13 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WRITETHRU	if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_writethru(void *dst, const void __user *src,
+				  unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+				  size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+				  unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_writethru(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..144cb5e59193 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_writethru(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_writethru(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_writethru);
+
+void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_writethru(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..c84e242f91ed 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,7 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
+			memcpy_writethru(mmio->addr.aperture + offset,
 					iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..38822f6fa49f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_writethru(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..28dc82a595a5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_writethru(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_writethru(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHRU)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..b668ba455c39 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,7 +947,7 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
+	 * writes to avoid the cache via memcpy_writethru().  The
 	 * final wmb() ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..f4e166d88e2a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WRITETHRU
+static inline void memcpy_writethru(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..d284cb5e89fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_writethru() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_writethru(void *addr, size_t bytes,
+		struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..db31bc186df2 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WRITETHRU
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..afc3dc75346c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_writethru((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_writethru((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_writethru((to += v.iov_len) - v.iov_len, v.iov_base,
+			v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_writethru);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
@ 2017-04-26 21:56   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-26 21:56 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, Jeff Moyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Ross Zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   13 ++++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    2 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    2 -
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   22 ++++++
 13 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WRITETHRU	if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_writethru(void *dst, const void __user *src,
+				  unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+				  size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+				  unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_writethru(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..144cb5e59193 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_writethru(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_writethru(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_writethru);
+
+void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_writethru(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..c84e242f91ed 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,7 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
+			memcpy_writethru(mmio->addr.aperture + offset,
 					iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..38822f6fa49f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_writethru(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..28dc82a595a5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_writethru(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_writethru(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHRU)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..b668ba455c39 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,7 +947,7 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
+	 * writes to avoid the cache via memcpy_writethru().  The
 	 * final wmb() ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..f4e166d88e2a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WRITETHRU
+static inline void memcpy_writethru(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..d284cb5e89fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_writethru() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_writethru(void *addr, size_t bytes,
+		struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..db31bc186df2 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WRITETHRU
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..afc3dc75346c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_writethru((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_writethru((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_writethru((to += v.iov_len) - v.iov_len, v.iov_base,
+			v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_writethru);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
  2017-04-26 21:56   ` Dan Williams
  (?)
@ 2017-04-27  6:30     ` Ingo Molnar
  -1 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-04-27  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, viro, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, hch


* Dan Williams <dan.j.williams@intel.com> wrote:

> +#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
> +#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
> +void memcpy_writethru(void *dst, const void *src, size_t cnt);
> +#endif

This should be named memcpy_wt(), which is the well-known postfix for 
write-through.

We already have ioremap_wt(), set_memory_wt(), etc. - no need to introduce a 
longer variant with uncommon spelling.

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
@ 2017-04-27  6:30     ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-04-27  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, Jeff Moyer, Ingo Molnar,
	H. Peter Anvin, linux-fsdevel, Thomas Gleixner, Ross Zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> +#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
> +#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
> +void memcpy_writethru(void *dst, const void *src, size_t cnt);
> +#endif

This should be named memcpy_wt(), which is the well-known postfix for 
write-through.

We already have ioremap_wt(), set_memory_wt(), etc. - no need to introduce a 
longer variant with uncommon spelling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
@ 2017-04-27  6:30     ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-04-27  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, Jeff Moyer, Ingo Molnar,
	H. Peter Anvin, linux-fsdevel, Thomas Gleixner, Ross Zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> +#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
> +#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
> +void memcpy_writethru(void *dst, const void *src, size_t cnt);
> +#endif

This should be named memcpy_wt(), which is the well-known postfix for 
write-through.

We already have ioremap_wt(), set_memory_wt(), etc. - no need to introduce a 
longer variant with uncommon spelling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-27  6:30     ` Ingo Molnar
  (?)
@ 2017-04-28 19:39       ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-28 19:39 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, hch

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
  etc. (Ingo)

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   11 +++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    3 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    4 +
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   21 ++++++
 13 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WT		if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+			   size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_wt(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a022f 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_wt(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_wt(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_wt);
+
+void memcpy_page_wt(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_wt(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..be9bba609f26 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,8 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
-					iobuf + copied, c);
+			memcpy_wt(mmio->addr.aperture + offset, iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
 				mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..864ed42baaf0 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_wt(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..4be8f30de9b3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_wt(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_wt(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WT)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..016af2a6694d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,8 +947,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
-	 * final wmb() ensures ordering for the NVDIMM flush write.
+	 * writes to avoid the cache via memcpy_wt().  The final wmb()
+	 * ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
 	for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..245e0a29b7e5 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WT
+static inline void memcpy_wt(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..30c43aa371b5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_wt() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_wt(void *addr, size_t bytes,
+				       struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..f0752a7a9001 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WT
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..19ab9af091f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,27 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_wt((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_wt((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_wt((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_wt);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-04-28 19:39       ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-28 19:39 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, ross.zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
  etc. (Ingo)

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   11 +++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    3 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    4 +
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   21 ++++++
 13 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WT		if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+			   size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_wt(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a022f 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_wt(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_wt(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_wt);
+
+void memcpy_page_wt(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_wt(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..be9bba609f26 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,8 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
-					iobuf + copied, c);
+			memcpy_wt(mmio->addr.aperture + offset, iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
 				mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..864ed42baaf0 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_wt(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..4be8f30de9b3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_wt(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_wt(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WT)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..016af2a6694d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,8 +947,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
-	 * final wmb() ensures ordering for the NVDIMM flush write.
+	 * writes to avoid the cache via memcpy_wt().  The final wmb()
+	 * ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
 	for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..245e0a29b7e5 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WT
+static inline void memcpy_wt(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..30c43aa371b5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_wt() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_wt(void *addr, size_t bytes,
+				       struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..f0752a7a9001 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WT
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..19ab9af091f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,27 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_wt((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_wt((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_wt((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_wt);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-04-28 19:39       ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-04-28 19:39 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, ross.zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
  etc. (Ingo)

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   11 +++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    3 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    4 +
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   21 ++++++
 13 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WT		if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+			   size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_wt(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a022f 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_wt(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_wt(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_wt);
+
+void memcpy_page_wt(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_wt(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..be9bba609f26 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,8 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
-					iobuf + copied, c);
+			memcpy_wt(mmio->addr.aperture + offset, iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
 				mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..864ed42baaf0 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_wt(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..4be8f30de9b3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_wt(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_wt(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WT)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..016af2a6694d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,8 +947,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
-	 * final wmb() ensures ordering for the NVDIMM flush write.
+	 * writes to avoid the cache via memcpy_wt().  The final wmb()
+	 * ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
 	for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..245e0a29b7e5 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WT
+static inline void memcpy_wt(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..30c43aa371b5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_wt() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_wt(void *addr, size_t bytes,
+				       struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..f0752a7a9001 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WT
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..19ab9af091f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,27 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_wt((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_wt((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_wt((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_wt);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39       ` Dan Williams
  (?)
@ 2017-05-05  6:54         ` Ingo Molnar
  -1 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-05  6:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, viro, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, hch


* Dan Williams <dan.j.williams@intel.com> wrote:

> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>   etc. (Ingo)

Looks good to me. I suspect you'd like to carry this in the nvdimm tree?

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05  6:54         ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-05  6:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>   etc. (Ingo)

Looks good to me. I suspect you'd like to carry this in the nvdimm tree?

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05  6:54         ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-05  6:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>   etc. (Ingo)

Looks good to me. I suspect you'd like to carry this in the nvdimm tree?

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05  6:54         ` Ingo Molnar
  (?)
@ 2017-05-05 14:12           ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 14:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Matthew Wilcox, X86 ML, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, Al Viro, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Christoph Hellwig

On Thu, May 4, 2017 at 11:54 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>>   etc. (Ingo)
>
> Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
>
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks, Ingo!. Yes, I'll carry it in nvdimm.git for 4.13.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 14:12           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 14:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, Jan Kara, Matthew Wilcox, X86 ML, linux-kernel,
	Christoph Hellwig, linux-block, linux-nvdimm, jmoyer,
	Ingo Molnar, H. Peter Anvin, linux-fsdevel, Thomas Gleixner,
	Ross Zwisler

On Thu, May 4, 2017 at 11:54 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>>   etc. (Ingo)
>
> Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
>
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks, Ingo!. Yes, I'll carry it in nvdimm.git for 4.13.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 14:12           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 14:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, Jan Kara, Matthew Wilcox, X86 ML, linux-kernel,
	Christoph Hellwig, linux-block, linux-nvdimm@lists.01.org,
	jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, Ross Zwisler

On Thu, May 4, 2017 at 11:54 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>>   etc. (Ingo)
>
> Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
>
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks, Ingo!. Yes, I'll carry it in nvdimm.git for 4.13.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39       ` Dan Williams
  (?)
  (?)
@ 2017-05-05 20:39         ` Kani, Toshimitsu
  -1 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
  To: dan.j.williams, viro
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, hpa, linux-fsdevel, tglx, hch

On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination
> writes are not cached. It is sufficient for the writes to be flushed
> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
> expect userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
> or REQ_FLUSH to the pmem driver which will turn around and fence
> previous writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
> memcpy_wt, that guarantee that the destination buffer is not dirty in
> the cpu cache on completion. The new copy_from_iter_wt and sub-
> routines will be used to replace the "pmem api" (include/linux/pmem.h
> + arch/x86/include/asm/pmem.h). The availability of
> copy_from_iter_wt() and memcpy_wt() are gated by the
> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
> copy_from_iter_nocache() and plain memcpy() otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver
> wants to do something beyond the normal nocache semantics it should
> be something private to that driver [1], and Al's concern that
> anything uaccess related belongs with the rest of the uaccess code
> [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
> html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
> ml
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(),
> set_memory_wt(), etc. (Ingo)

Sorry I should have said earlier, but I think the term "wt" is
misleading.  Non-temporal stores used in memcpy_wt() provide WC
semantics, not WT semantics.  How about using "nocache" as it's been
used in __copy_user_nocache()?

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 20:39         ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
  To: dan.j.williams, viro
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, x86, mawilcox, hpa,
	linux-nvdimm, mingo, linux-fsdevel, ross.zwisler, jack

T24gRnJpLCAyMDE3LTA0LTI4IGF0IDEyOjM5IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IFRoZSBwbWVtIGRyaXZlciBoYXMgYSBuZWVkIHRvIHRyYW5zZmVyIGRhdGEgd2l0aCBhIHBlcnNp
c3RlbnQgbWVtb3J5DQo+IGRlc3RpbmF0aW9uIGFuZCBiZSBhYmxlIHRvIHJlbHkgb24gdGhlIGZh
Y3QgdGhhdCB0aGUgZGVzdGluYXRpb24NCj4gd3JpdGVzIGFyZSBub3QgY2FjaGVkLiBJdCBpcyBz
dWZmaWNpZW50IGZvciB0aGUgd3JpdGVzIHRvIGJlIGZsdXNoZWQNCj4gdG8gYSBjcHUtc3RvcmUt
YnVmZmVyIChub24tdGVtcG9yYWwgLyAibW92bnQiIGluIHg4NiB0ZXJtcyksIGFzIHdlDQo+IGV4
cGVjdCB1c2Vyc3BhY2UgdG8gY2FsbCBmc3luYygpIHRvIGVuc3VyZSBkYXRhLXdyaXRlcyBoYXZl
IHJlYWNoZWQgYQ0KPiBwb3dlci1mYWlsLXNhZmUgem9uZSBpbiB0aGUgcGxhdGZvcm0uIFRoZSBm
c3luYygpIHRyaWdnZXJzIGEgUkVRX0ZVQQ0KPiBvciBSRVFfRkxVU0ggdG8gdGhlIHBtZW0gZHJp
dmVyIHdoaWNoIHdpbGwgdHVybiBhcm91bmQgYW5kIGZlbmNlDQo+IHByZXZpb3VzIHdyaXRlcyB3
aXRoIGFuICJzZmVuY2UiLg0KPiANCj4gSW1wbGVtZW50IGEgX19jb3B5X2Zyb21fdXNlcl9pbmF0
b21pY193dCwgbWVtY3B5X3BhZ2Vfd3QsIGFuZA0KPiBtZW1jcHlfd3QsIHRoYXQgZ3VhcmFudGVl
IHRoYXQgdGhlIGRlc3RpbmF0aW9uIGJ1ZmZlciBpcyBub3QgZGlydHkgaW4NCj4gdGhlIGNwdSBj
YWNoZSBvbiBjb21wbGV0aW9uLiBUaGUgbmV3IGNvcHlfZnJvbV9pdGVyX3d0IGFuZCBzdWItDQo+
IHJvdXRpbmVzIHdpbGwgYmUgdXNlZCB0byByZXBsYWNlIHRoZSAicG1lbSBhcGkiIChpbmNsdWRl
L2xpbnV4L3BtZW0uaA0KPiArIGFyY2gveDg2L2luY2x1ZGUvYXNtL3BtZW0uaCkuIFRoZSBhdmFp
bGFiaWxpdHkgb2YNCj4gY29weV9mcm9tX2l0ZXJfd3QoKSBhbmQgbWVtY3B5X3d0KCkgYXJlIGdh
dGVkIGJ5IHRoZQ0KPiBDT05GSUdfQVJDSF9IQVNfVUFDQ0VTU19XVCBjb25maWcgc3ltYm9sLCBh
bmQgZmFsbGJhY2sgdG8NCj4gY29weV9mcm9tX2l0ZXJfbm9jYWNoZSgpIGFuZCBwbGFpbiBtZW1j
cHkoKSBvdGhlcndpc2UuDQo+IA0KPiBUaGlzIGlzIG1lYW50IHRvIHNhdGlzZnkgdGhlIGNvbmNl
cm4gZnJvbSBMaW51cyB0aGF0IGlmIGEgZHJpdmVyDQo+IHdhbnRzIHRvIGRvIHNvbWV0aGluZyBi
ZXlvbmQgdGhlIG5vcm1hbCBub2NhY2hlIHNlbWFudGljcyBpdCBzaG91bGQNCj4gYmUgc29tZXRo
aW5nIHByaXZhdGUgdG8gdGhhdCBkcml2ZXIgWzFdLCBhbmQgQWwncyBjb25jZXJuIHRoYXQNCj4g
YW55dGhpbmcgdWFjY2VzcyByZWxhdGVkIGJlbG9uZ3Mgd2l0aCB0aGUgcmVzdCBvZiB0aGUgdWFj
Y2VzcyBjb2RlDQo+IFsyXS4NCj4gDQo+IFsxXTogaHR0cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJt
YWlsL2xpbnV4LW52ZGltbS8yMDE3LUphbnVhcnkvMDA4MzY0Lg0KPiBodG1sDQo+IFsyXTogaHR0
cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJtYWlsL2xpbnV4LW52ZGltbS8yMDE3LUFwcmlsLzAwOTk0
Mi5odA0KPiBtbA0KPiANCj4gQ2M6IDx4ODZAa2VybmVsLm9yZz4NCj4gQ2M6IEphbiBLYXJhIDxq
YWNrQHN1c2UuY3o+DQo+IENjOiBKZWZmIE1veWVyIDxqbW95ZXJAcmVkaGF0LmNvbT4NCj4gQ2M6
IEluZ28gTW9sbmFyIDxtaW5nb0ByZWRoYXQuY29tPg0KPiBDYzogQ2hyaXN0b3BoIEhlbGx3aWcg
PGhjaEBsc3QuZGU+DQo+IENjOiAiSC4gUGV0ZXIgQW52aW4iIDxocGFAenl0b3IuY29tPg0KPiBD
YzogQWwgVmlybyA8dmlyb0B6ZW5pdi5saW51eC5vcmcudWs+DQo+IENjOiBUaG9tYXMgR2xlaXhu
ZXIgPHRnbHhAbGludXRyb25peC5kZT4NCj4gQ2M6IE1hdHRoZXcgV2lsY294IDxtYXdpbGNveEBt
aWNyb3NvZnQuY29tPg0KPiBDYzogUm9zcyBad2lzbGVyIDxyb3NzLnp3aXNsZXJAbGludXguaW50
ZWwuY29tPg0KPiBTaWduZWQtb2ZmLWJ5OiBEYW4gV2lsbGlhbXMgPGRhbi5qLndpbGxpYW1zQGlu
dGVsLmNvbT4NCj4gLS0tDQo+IENoYW5nZXMgc2luY2UgdGhlIGluaXRpYWwgUkZDOg0KPiAqIHMv
d3JpdGV0aHJ1L3d0LyBzaW5jZSB3ZSBhbHJlYWR5IGhhdmUgaW9yZW1hcF93dCgpLA0KPiBzZXRf
bWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQoNClNvcnJ5IEkgc2hvdWxkIGhhdmUgc2FpZCBlYXJs
aWVyLCBidXQgSSB0aGluayB0aGUgdGVybSAid3QiIGlzDQptaXNsZWFkaW5nLiAgTm9uLXRlbXBv
cmFsIHN0b3JlcyB1c2VkIGluIG1lbWNweV93dCgpIHByb3ZpZGUgV0MNCnNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4gIEhvdyBhYm91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQp1
c2VkIGluIF9fY29weV91c2VyX25vY2FjaGUoKT8NCg0KVGhhbmtzLA0KLVRvc2hpDQo=

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 20:39         ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
  To: dan.j.williams, viro
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, x86, mawilcox, hpa,
	linux-nvdimm@lists.01.org, mingo, linux-fsdevel, ross.zwisler,
	jack

On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination
> writes are not cached. It is sufficient for the writes to be flushed
> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
> expect userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
> or REQ_FLUSH to the pmem driver which will turn around and fence
> previous writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
> memcpy_wt, that guarantee that the destination buffer is not dirty in
> the cpu cache on completion. The new copy_from_iter_wt and sub-
> routines will be used to replace the "pmem api" (include/linux/pmem.h
> + arch/x86/include/asm/pmem.h). The availability of
> copy_from_iter_wt() and memcpy_wt() are gated by the
> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
> copy_from_iter_nocache() and plain memcpy() otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver
> wants to do something beyond the normal nocache semantics it should
> be something private to that driver [1], and Al's concern that
> anything uaccess related belongs with the rest of the uaccess code
> [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
> html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
> ml
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(),
> set_memory_wt(), etc. (Ingo)

Sorry I should have said earlier, but I think the term "wt" is
misleading.  Non-temporal stores used in memcpy_wt() provide WC
semantics, not WT semantics.  How about using "nocache" as it's been
used in __copy_user_nocache()?

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 20:39         ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
  To: dan.j.williams, viro
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, x86, mawilcox, hpa,
	linux-nvdimm, mingo, linux-fsdevel, ross.zwisler, jack

On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination
> writes are not cached. It is sufficient for the writes to be flushed
> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
> expect userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
> or REQ_FLUSH to the pmem driver which will turn around and fence
> previous writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
> memcpy_wt, that guarantee that the destination buffer is not dirty in
> the cpu cache on completion. The new copy_from_iter_wt and sub-
> routines will be used to replace the "pmem api" (include/linux/pmem.h
> + arch/x86/include/asm/pmem.h). The availability of
> copy_from_iter_wt() and memcpy_wt() are gated by the
> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
> copy_from_iter_nocache() and plain memcpy() otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver
> wants to do something beyond the normal nocache semantics it should
> be something private to that driver [1], and Al's concern that
> anything uaccess related belongs with the rest of the uaccess code
> [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
> html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
> ml
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(),
> set_memory_wt(), etc. (Ingo)

Sorry I should have said earlier, but I think the term "wt" is
misleading.  Non-temporal stores used in memcpy_wt() provide WC
semantics, not WT semantics.  How about using "nocache" as it's been
used in __copy_user_nocache()?

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 20:39         ` Kani, Toshimitsu
  (?)
@ 2017-05-05 22:25           ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination
>> writes are not cached. It is sufficient for the writes to be flushed
>> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
>> expect userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
>> or REQ_FLUSH to the pmem driver which will turn around and fence
>> previous writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
>> memcpy_wt, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_wt and sub-
>> routines will be used to replace the "pmem api" (include/linux/pmem.h
>> + arch/x86/include/asm/pmem.h). The availability of
>> copy_from_iter_wt() and memcpy_wt() are gated by the
>> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
>> copy_from_iter_nocache() and plain memcpy() otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver
>> wants to do something beyond the normal nocache semantics it should
>> be something private to that driver [1], and Al's concern that
>> anything uaccess related belongs with the rest of the uaccess code
>> [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
>> html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
>> ml
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(),
>> set_memory_wt(), etc. (Ingo)
>
> Sorry I should have said earlier, but I think the term "wt" is
> misleading.  Non-temporal stores used in memcpy_wt() provide WC
> semantics, not WT semantics.

The non-temporal stores do, but memcpy_wt() is using a combination of
non-temporal stores and explicit cache flushing.

> How about using "nocache" as it's been
> used in __copy_user_nocache()?

The difference in my mind is that the "_nocache" suffix indicates
opportunistic / optional cache pollution avoidance whereas "_wt"
strictly arranges for caches not to contain dirty data upon completion
of the routine. For example, non-temporal stores on older x86 cpus
could potentially leave dirty data in the cache, so memcpy_wt on those
cpus would need to use explicit cache flushing.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 22:25           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: viro, linux-kernel, linux-block, jmoyer, tglx, hch, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination
>> writes are not cached. It is sufficient for the writes to be flushed
>> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
>> expect userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
>> or REQ_FLUSH to the pmem driver which will turn around and fence
>> previous writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
>> memcpy_wt, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_wt and sub-
>> routines will be used to replace the "pmem api" (include/linux/pmem.h
>> + arch/x86/include/asm/pmem.h). The availability of
>> copy_from_iter_wt() and memcpy_wt() are gated by the
>> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
>> copy_from_iter_nocache() and plain memcpy() otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver
>> wants to do something beyond the normal nocache semantics it should
>> be something private to that driver [1], and Al's concern that
>> anything uaccess related belongs with the rest of the uaccess code
>> [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
>> html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
>> ml
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(),
>> set_memory_wt(), etc. (Ingo)
>
> Sorry I should have said earlier, but I think the term "wt" is
> misleading.  Non-temporal stores used in memcpy_wt() provide WC
> semantics, not WT semantics.

The non-temporal stores do, but memcpy_wt() is using a combination of
non-temporal stores and explicit cache flushing.

> How about using "nocache" as it's been
> used in __copy_user_nocache()?

The difference in my mind is that the "_nocache" suffix indicates
opportunistic / optional cache pollution avoidance whereas "_wt"
strictly arranges for caches not to contain dirty data upon completion
of the routine. For example, non-temporal stores on older x86 cpus
could potentially leave dirty data in the cache, so memcpy_wt on those
cpus would need to use explicit cache flushing.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 22:25           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: viro, linux-kernel, linux-block, jmoyer, tglx, hch, x86,
	mawilcox, hpa, linux-nvdimm@lists.01.org, mingo, linux-fsdevel,
	ross.zwisler, jack

On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination
>> writes are not cached. It is sufficient for the writes to be flushed
>> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
>> expect userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
>> or REQ_FLUSH to the pmem driver which will turn around and fence
>> previous writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
>> memcpy_wt, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_wt and sub-
>> routines will be used to replace the "pmem api" (include/linux/pmem.h
>> + arch/x86/include/asm/pmem.h). The availability of
>> copy_from_iter_wt() and memcpy_wt() are gated by the
>> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
>> copy_from_iter_nocache() and plain memcpy() otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver
>> wants to do something beyond the normal nocache semantics it should
>> be something private to that driver [1], and Al's concern that
>> anything uaccess related belongs with the rest of the uaccess code
>> [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
>> html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
>> ml
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(),
>> set_memory_wt(), etc. (Ingo)
>
> Sorry I should have said earlier, but I think the term "wt" is
> misleading.  Non-temporal stores used in memcpy_wt() provide WC
> semantics, not WT semantics.

The non-temporal stores do, but memcpy_wt() is using a combination of
non-temporal stores and explicit cache flushing.

> How about using "nocache" as it's been
> used in __copy_user_nocache()?

The difference in my mind is that the "_nocache" suffix indicates
opportunistic / optional cache pollution avoidance whereas "_wt"
strictly arranges for caches not to contain dirty data upon completion
of the routine. For example, non-temporal stores on older x86 cpus
could potentially leave dirty data in the cache, so memcpy_wt on those
cpus would need to use explicit cache flushing.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 22:25           ` Dan Williams
  (?)
  (?)
@ 2017-05-05 22:44             ` Kani, Toshimitsu
  -1 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
  To: dan.j.williams
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
 :
> > > ---
> > > Changes since the initial RFC:
> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > > set_memory_wt(), etc. (Ingo)
> > 
> > Sorry I should have said earlier, but I think the term "wt" is
> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > semantics, not WT semantics.
> 
> The non-temporal stores do, but memcpy_wt() is using a combination of
> non-temporal stores and explicit cache flushing.
> 
> > How about using "nocache" as it's been
> > used in __copy_user_nocache()?
> 
> The difference in my mind is that the "_nocache" suffix indicates
> opportunistic / optional cache pollution avoidance whereas "_wt"
> strictly arranges for caches not to contain dirty data upon
> completion of the routine. For example, non-temporal stores on older
> x86 cpus could potentially leave dirty data in the cache, so
> memcpy_wt on those cpus would need to use explicit cache flushing.

I see.  I agree that its behavior is different from the existing one
with "_nocache".   That said, I think "wt" or "write-through" generally
means that writes allocate cachelines and keep them clean by writing to
memory.  So, subsequent reads to the destination will hit the
cachelines.  This is not the case with this interface.

Thanks,
-Toshi
 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 22:44             ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

T24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQgMTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGku
a2FuaUBocGUuY29tPg0KPiB3cm90ZToNCiA6DQo+ID4gPiAtLS0NCj4gPiA+IENoYW5nZXMgc2lu
Y2UgdGhlIGluaXRpYWwgUkZDOg0KPiA+ID4gKiBzL3dyaXRldGhydS93dC8gc2luY2Ugd2UgYWxy
ZWFkeSBoYXZlIGlvcmVtYXBfd3QoKSwNCj4gPiA+IHNldF9tZW1vcnlfd3QoKSwgZXRjLiAoSW5n
bykNCj4gPiANCj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+IG1pc2xlYWRpbmcuwqDCoE5vbi10ZW1wb3JhbCBzdG9y
ZXMgdXNlZCBpbiBtZW1jcHlfd3QoKSBwcm92aWRlIFdDDQo+ID4gc2VtYW50aWNzLCBub3QgV1Qg
c2VtYW50aWNzLg0KPiANCj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1dCBtZW1jcHlf
d3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+IG5vbi10ZW1wb3JhbCBzdG9yZXMgYW5k
IGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiANCj4gPiBIb3cgYWJvdXQgdXNpbmcgIm5vY2Fj
aGUiIGFzIGl0J3MgYmVlbg0KPiA+IHVzZWQgaW4gX19jb3B5X3VzZXJfbm9jYWNoZSgpPw0KPiAN
Cj4gVGhlIGRpZmZlcmVuY2UgaW4gbXkgbWluZCBpcyB0aGF0IHRoZSAiX25vY2FjaGUiIHN1ZmZp
eCBpbmRpY2F0ZXMNCj4gb3Bwb3J0dW5pc3RpYyAvIG9wdGlvbmFsIGNhY2hlIHBvbGx1dGlvbiBh
dm9pZGFuY2Ugd2hlcmVhcyAiX3d0Ig0KPiBzdHJpY3RseSBhcnJhbmdlcyBmb3IgY2FjaGVzIG5v
dCB0byBjb250YWluIGRpcnR5IGRhdGEgdXBvbg0KPiBjb21wbGV0aW9uIG9mIHRoZSByb3V0aW5l
LiBGb3IgZXhhbXBsZSwgbm9uLXRlbXBvcmFsIHN0b3JlcyBvbiBvbGRlcg0KPiB4ODYgY3B1cyBj
b3VsZCBwb3RlbnRpYWxseSBsZWF2ZSBkaXJ0eSBkYXRhIGluIHRoZSBjYWNoZSwgc28NCj4gbWVt
Y3B5X3d0IG9uIHRob3NlIGNwdXMgd291bGQgbmVlZCB0byB1c2UgZXhwbGljaXQgY2FjaGUgZmx1
c2hpbmcuDQoNCkkgc2VlLiAgSSBhZ3JlZSB0aGF0IGl0cyBiZWhhdmlvciBpcyBkaWZmZXJlbnQg
ZnJvbSB0aGUgZXhpc3Rpbmcgb25lDQp3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRo
aW5rICJ3dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KbWVhbnMgdGhhdCB3cml0ZXMg
YWxsb2NhdGUgY2FjaGVsaW5lcyBhbmQga2VlcCB0aGVtIGNsZWFuIGJ5IHdyaXRpbmcgdG8NCm1l
bW9yeS4gIFNvLCBzdWJzZXF1ZW50IHJlYWRzIHRvIHRoZSBkZXN0aW5hdGlvbiB3aWxsIGhpdCB0
aGUNCmNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50ZXJmYWNl
Lg0KDQpUaGFua3MsDQotVG9zaGkNCiA=

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 22:44             ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm@lists.01.org, mingo, linux-fsdevel,
	ross.zwisler, jack

On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
 :
> > > ---
> > > Changes since the initial RFC:
> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > > set_memory_wt(), etc. (Ingo)
> > 
> > Sorry I should have said earlier, but I think the term "wt" is
> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > semantics, not WT semantics.
> 
> The non-temporal stores do, but memcpy_wt() is using a combination of
> non-temporal stores and explicit cache flushing.
> 
> > How about using "nocache" as it's been
> > used in __copy_user_nocache()?
> 
> The difference in my mind is that the "_nocache" suffix indicates
> opportunistic / optional cache pollution avoidance whereas "_wt"
> strictly arranges for caches not to contain dirty data upon
> completion of the routine. For example, non-temporal stores on older
> x86 cpus could potentially leave dirty data in the cache, so
> memcpy_wt on those cpus would need to use explicit cache flushing.

I see.  I agree that its behavior is different from the existing one
with "_nocache".   That said, I think "wt" or "write-through" generally
means that writes allocate cachelines and keep them clean by writing to
memory.  So, subsequent reads to the destination will hit the
cachelines.  This is not the case with this interface.

Thanks,
-Toshi
 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-05 22:44             ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
 :
> > > ---
> > > Changes since the initial RFC:
> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > > set_memory_wt(), etc. (Ingo)
> > 
> > Sorry I should have said earlier, but I think the term "wt" is
> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > semantics, not WT semantics.
> 
> The non-temporal stores do, but memcpy_wt() is using a combination of
> non-temporal stores and explicit cache flushing.
> 
> > How about using "nocache" as it's been
> > used in __copy_user_nocache()?
> 
> The difference in my mind is that the "_nocache" suffix indicates
> opportunistic / optional cache pollution avoidance whereas "_wt"
> strictly arranges for caches not to contain dirty data upon
> completion of the routine. For example, non-temporal stores on older
> x86 cpus could potentially leave dirty data in the cache, so
> memcpy_wt on those cpus would need to use explicit cache flushing.

I see.  I agree that its behavior is different from the existing one
with "_nocache".   That said, I think "wt" or "write-through" generally
means that writes allocate cachelines and keep them clean by writing to
memory.  So, subsequent reads to the destination will hit the
cachelines.  This is not the case with this interface.

Thanks,
-Toshi
 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 22:44             ` Kani, Toshimitsu
  (?)
@ 2017-05-06  2:15               ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06  2:15 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>  :
>> > > ---
>> > > Changes since the initial RFC:
>> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> > > set_memory_wt(), etc. (Ingo)
>> >
>> > Sorry I should have said earlier, but I think the term "wt" is
>> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> > semantics, not WT semantics.
>>
>> The non-temporal stores do, but memcpy_wt() is using a combination of
>> non-temporal stores and explicit cache flushing.
>>
>> > How about using "nocache" as it's been
>> > used in __copy_user_nocache()?
>>
>> The difference in my mind is that the "_nocache" suffix indicates
>> opportunistic / optional cache pollution avoidance whereas "_wt"
>> strictly arranges for caches not to contain dirty data upon
>> completion of the routine. For example, non-temporal stores on older
>> x86 cpus could potentially leave dirty data in the cache, so
>> memcpy_wt on those cpus would need to use explicit cache flushing.
>
> I see.  I agree that its behavior is different from the existing one
> with "_nocache".   That said, I think "wt" or "write-through" generally
> means that writes allocate cachelines and keep them clean by writing to
> memory.  So, subsequent reads to the destination will hit the
> cachelines.  This is not the case with this interface.

True... maybe _nocache_strict()? Or, leave it _wt() until someone
comes along and is surprised that the cache is not warm for reads
after memcpy_wt(), at which point we can ask "why not just use plain
memcpy then?", or set the page-attributes to WT.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  2:15               ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06  2:15 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>  :
>> > > ---
>> > > Changes since the initial RFC:
>> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> > > set_memory_wt(), etc. (Ingo)
>> >
>> > Sorry I should have said earlier, but I think the term "wt" is
>> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> > semantics, not WT semantics.
>>
>> The non-temporal stores do, but memcpy_wt() is using a combination of
>> non-temporal stores and explicit cache flushing.
>>
>> > How about using "nocache" as it's been
>> > used in __copy_user_nocache()?
>>
>> The difference in my mind is that the "_nocache" suffix indicates
>> opportunistic / optional cache pollution avoidance whereas "_wt"
>> strictly arranges for caches not to contain dirty data upon
>> completion of the routine. For example, non-temporal stores on older
>> x86 cpus could potentially leave dirty data in the cache, so
>> memcpy_wt on those cpus would need to use explicit cache flushing.
>
> I see.  I agree that its behavior is different from the existing one
> with "_nocache".   That said, I think "wt" or "write-through" generally
> means that writes allocate cachelines and keep them clean by writing to
> memory.  So, subsequent reads to the destination will hit the
> cachelines.  This is not the case with this interface.

True... maybe _nocache_strict()? Or, leave it _wt() until someone
comes along and is surprised that the cache is not warm for reads
after memcpy_wt(), at which point we can ask "why not just use plain
memcpy then?", or set the page-attributes to WT.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  2:15               ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06  2:15 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm@lists.01.org, mingo, linux-fsdevel,
	ross.zwisler, jack

On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>  :
>> > > ---
>> > > Changes since the initial RFC:
>> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> > > set_memory_wt(), etc. (Ingo)
>> >
>> > Sorry I should have said earlier, but I think the term "wt" is
>> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> > semantics, not WT semantics.
>>
>> The non-temporal stores do, but memcpy_wt() is using a combination of
>> non-temporal stores and explicit cache flushing.
>>
>> > How about using "nocache" as it's been
>> > used in __copy_user_nocache()?
>>
>> The difference in my mind is that the "_nocache" suffix indicates
>> opportunistic / optional cache pollution avoidance whereas "_wt"
>> strictly arranges for caches not to contain dirty data upon
>> completion of the routine. For example, non-temporal stores on older
>> x86 cpus could potentially leave dirty data in the cache, so
>> memcpy_wt on those cpus would need to use explicit cache flushing.
>
> I see.  I agree that its behavior is different from the existing one
> with "_nocache".   That said, I think "wt" or "write-through" generally
> means that writes allocate cachelines and keep them clean by writing to
> memory.  So, subsequent reads to the destination will hit the
> cachelines.  This is not the case with this interface.

True... maybe _nocache_strict()? Or, leave it _wt() until someone
comes along and is surprised that the cache is not warm for reads
after memcpy_wt(), at which point we can ask "why not just use plain
memcpy then?", or set the page-attributes to WT.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  2:15               ` Dan Williams
  (?)
  (?)
@ 2017-05-06  3:17                 ` Kani, Toshimitsu
  -1 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06  3:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

I prefer _nocache_strict(), if it's not too long, since it avoids any
confusion.  If other arches actually implement it with WT semantics,
we might become the one to change it, instead of the caller.

Thanks,
-Toshi

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  3:17                 ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06  3:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

PiBPbiBGcmksIE1heSA1LCAyMDE3IGF0IDM6NDQgUE0sIEthbmksIFRvc2hpbWl0c3UgPHRvc2hp
LmthbmlAaHBlLmNvbT4NCj4gd3JvdGU6DQo+ID4gT24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1
IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+ID4+IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQg
MTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGkua2FuaUBocGUuY29tPg0KPiA+PiB3cm90
ZToNCj4gPiAgOg0KPiA+PiA+ID4gLS0tDQo+ID4+ID4gPiBDaGFuZ2VzIHNpbmNlIHRoZSBpbml0
aWFsIFJGQzoNCj4gPj4gPiA+ICogcy93cml0ZXRocnUvd3QvIHNpbmNlIHdlIGFscmVhZHkgaGF2
ZSBpb3JlbWFwX3d0KCksDQo+ID4+ID4gPiBzZXRfbWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQo+
ID4+ID4NCj4gPj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+PiA+IG1pc2xlYWRpbmcuICBOb24tdGVtcG9yYWwgc3Rv
cmVzIHVzZWQgaW4gbWVtY3B5X3d0KCkgcHJvdmlkZSBXQw0KPiA+PiA+IHNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4NCj4gPj4NCj4gPj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1
dCBtZW1jcHlfd3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+ID4+IG5vbi10ZW1wb3Jh
bCBzdG9yZXMgYW5kIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+Pg0KPiA+PiA+IEhvdyBh
Ym91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQo+ID4+ID4gdXNlZCBpbiBfX2NvcHlf
dXNlcl9ub2NhY2hlKCk/DQo+ID4+DQo+ID4+IFRoZSBkaWZmZXJlbmNlIGluIG15IG1pbmQgaXMg
dGhhdCB0aGUgIl9ub2NhY2hlIiBzdWZmaXggaW5kaWNhdGVzDQo+ID4+IG9wcG9ydHVuaXN0aWMg
LyBvcHRpb25hbCBjYWNoZSBwb2xsdXRpb24gYXZvaWRhbmNlIHdoZXJlYXMgIl93dCINCj4gPj4g
c3RyaWN0bHkgYXJyYW5nZXMgZm9yIGNhY2hlcyBub3QgdG8gY29udGFpbiBkaXJ0eSBkYXRhIHVw
b24NCj4gPj4gY29tcGxldGlvbiBvZiB0aGUgcm91dGluZS4gRm9yIGV4YW1wbGUsIG5vbi10ZW1w
b3JhbCBzdG9yZXMgb24gb2xkZXINCj4gPj4geDg2IGNwdXMgY291bGQgcG90ZW50aWFsbHkgbGVh
dmUgZGlydHkgZGF0YSBpbiB0aGUgY2FjaGUsIHNvDQo+ID4+IG1lbWNweV93dCBvbiB0aG9zZSBj
cHVzIHdvdWxkIG5lZWQgdG8gdXNlIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+DQo+ID4g
SSBzZWUuICBJIGFncmVlIHRoYXQgaXRzIGJlaGF2aW9yIGlzIGRpZmZlcmVudCBmcm9tIHRoZSBl
eGlzdGluZyBvbmUNCj4gPiB3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRoaW5rICJ3
dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KPiA+IG1lYW5zIHRoYXQgd3JpdGVzIGFs
bG9jYXRlIGNhY2hlbGluZXMgYW5kIGtlZXAgdGhlbSBjbGVhbiBieSB3cml0aW5nIHRvDQo+ID4g
bWVtb3J5LiAgU28sIHN1YnNlcXVlbnQgcmVhZHMgdG8gdGhlIGRlc3RpbmF0aW9uIHdpbGwgaGl0
IHRoZQ0KPiA+IGNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50
ZXJmYWNlLg0KPiANCj4gVHJ1ZS4uLiBtYXliZSBfbm9jYWNoZV9zdHJpY3QoKT8gT3IsIGxlYXZl
IGl0IF93dCgpIHVudGlsIHNvbWVvbmUNCj4gY29tZXMgYWxvbmcgYW5kIGlzIHN1cnByaXNlZCB0
aGF0IHRoZSBjYWNoZSBpcyBub3Qgd2FybSBmb3IgcmVhZHMNCj4gYWZ0ZXIgbWVtY3B5X3d0KCks
IGF0IHdoaWNoIHBvaW50IHdlIGNhbiBhc2sgIndoeSBub3QganVzdCB1c2UgcGxhaW4NCj4gbWVt
Y3B5IHRoZW4/Iiwgb3Igc2V0IHRoZSBwYWdlLWF0dHJpYnV0ZXMgdG8gV1QuDQoNCkkgcHJlZmVy
IF9ub2NhY2hlX3N0cmljdCgpLCBpZiBpdCdzIG5vdCB0b28gbG9uZywgc2luY2UgaXQgYXZvaWRz
IGFueQ0KY29uZnVzaW9uLiAgSWYgb3RoZXIgYXJjaGVzIGFjdHVhbGx5IGltcGxlbWVudCBpdCB3
aXRoIFdUIHNlbWFudGljcywNCndlIG1pZ2h0IGJlY29tZSB0aGUgb25lIHRvIGNoYW5nZSBpdCwg
aW5zdGVhZCBvZiB0aGUgY2FsbGVyLg0KDQpUaGFua3MsDQotVG9zaGkNCg0K

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  3:17                 ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06  3:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm@lists.01.org, mingo, linux-fsdevel,
	ross.zwisler, jack

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

I prefer _nocache_strict(), if it's not too long, since it avoids any
confusion.  If other arches actually implement it with WT semantics,
we might become the one to change it, instead of the caller.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  3:17                 ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06  3:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

I prefer _nocache_strict(), if it's not too long, since it avoids any
confusion.  If other arches actually implement it with WT semantics,
we might become the one to change it, instead of the caller.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  2:15               ` Dan Williams
  (?)
@ 2017-05-06  9:46                 ` Ingo Molnar
  -1 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-06  9:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that 
no cache line is left around afterwards (dirty or clean)?

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  9:46                 ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-06  9:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel,
	ross.zwisler, jack


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that 
no cache line is left around afterwards (dirty or clean)?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06  9:46                 ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-06  9:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm@lists.01.org, mingo,
	linux-fsdevel, ross.zwisler, jack


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that 
no cache line is left around afterwards (dirty or clean)?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  9:46                 ` Ingo Molnar
  (?)
@ 2017-05-06 13:57                   ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> >> wrote:
>> >  :
>> >> > > ---
>> >> > > Changes since the initial RFC:
>> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> >> > > set_memory_wt(), etc. (Ingo)
>> >> >
>> >> > Sorry I should have said earlier, but I think the term "wt" is
>> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> >> > semantics, not WT semantics.
>> >>
>> >> The non-temporal stores do, but memcpy_wt() is using a combination of
>> >> non-temporal stores and explicit cache flushing.
>> >>
>> >> > How about using "nocache" as it's been
>> >> > used in __copy_user_nocache()?
>> >>
>> >> The difference in my mind is that the "_nocache" suffix indicates
>> >> opportunistic / optional cache pollution avoidance whereas "_wt"
>> >> strictly arranges for caches not to contain dirty data upon
>> >> completion of the routine. For example, non-temporal stores on older
>> >> x86 cpus could potentially leave dirty data in the cache, so
>> >> memcpy_wt on those cpus would need to use explicit cache flushing.
>> >
>> > I see.  I agree that its behavior is different from the existing one
>> > with "_nocache".   That said, I think "wt" or "write-through" generally
>> > means that writes allocate cachelines and keep them clean by writing to
>> > memory.  So, subsequent reads to the destination will hit the
>> > cachelines.  This is not the case with this interface.
>>
>> True... maybe _nocache_strict()? Or, leave it _wt() until someone
>> comes along and is surprised that the cache is not warm for reads
>> after memcpy_wt(), at which point we can ask "why not just use plain
>> memcpy then?", or set the page-attributes to WT.
>
> Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> no cache line is left around afterwards (dirty or clean)?

Yes, I think "flush" belongs in the name, and to make it easily
grep-able separate from _nocache we can call it _flushcache? An
efficient implementation will use _nocache / non-temporal stores
internally, but external consumers just care about the state of the
cache after the call.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06 13:57                   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel,
	ross.zwisler, jack

On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> >> wrote:
>> >  :
>> >> > > ---
>> >> > > Changes since the initial RFC:
>> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> >> > > set_memory_wt(), etc. (Ingo)
>> >> >
>> >> > Sorry I should have said earlier, but I think the term "wt" is
>> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> >> > semantics, not WT semantics.
>> >>
>> >> The non-temporal stores do, but memcpy_wt() is using a combination of
>> >> non-temporal stores and explicit cache flushing.
>> >>
>> >> > How about using "nocache" as it's been
>> >> > used in __copy_user_nocache()?
>> >>
>> >> The difference in my mind is that the "_nocache" suffix indicates
>> >> opportunistic / optional cache pollution avoidance whereas "_wt"
>> >> strictly arranges for caches not to contain dirty data upon
>> >> completion of the routine. For example, non-temporal stores on older
>> >> x86 cpus could potentially leave dirty data in the cache, so
>> >> memcpy_wt on those cpus would need to use explicit cache flushing.
>> >
>> > I see.  I agree that its behavior is different from the existing one
>> > with "_nocache".   That said, I think "wt" or "write-through" generally
>> > means that writes allocate cachelines and keep them clean by writing to
>> > memory.  So, subsequent reads to the destination will hit the
>> > cachelines.  This is not the case with this interface.
>>
>> True... maybe _nocache_strict()? Or, leave it _wt() until someone
>> comes along and is surprised that the cache is not warm for reads
>> after memcpy_wt(), at which point we can ask "why not just use plain
>> memcpy then?", or set the page-attributes to WT.
>
> Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> no cache line is left around afterwards (dirty or clean)?

Yes, I think "flush" belongs in the name, and to make it easily
grep-able separate from _nocache we can call it _flushcache? An
efficient implementation will use _nocache / non-temporal stores
internally, but external consumers just care about the state of the
cache after the call.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-06 13:57                   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-06 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm@lists.01.org, mingo,
	linux-fsdevel, ross.zwisler, jack

On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> >> wrote:
>> >  :
>> >> > > ---
>> >> > > Changes since the initial RFC:
>> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> >> > > set_memory_wt(), etc. (Ingo)
>> >> >
>> >> > Sorry I should have said earlier, but I think the term "wt" is
>> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> >> > semantics, not WT semantics.
>> >>
>> >> The non-temporal stores do, but memcpy_wt() is using a combination of
>> >> non-temporal stores and explicit cache flushing.
>> >>
>> >> > How about using "nocache" as it's been
>> >> > used in __copy_user_nocache()?
>> >>
>> >> The difference in my mind is that the "_nocache" suffix indicates
>> >> opportunistic / optional cache pollution avoidance whereas "_wt"
>> >> strictly arranges for caches not to contain dirty data upon
>> >> completion of the routine. For example, non-temporal stores on older
>> >> x86 cpus could potentially leave dirty data in the cache, so
>> >> memcpy_wt on those cpus would need to use explicit cache flushing.
>> >
>> > I see.  I agree that its behavior is different from the existing one
>> > with "_nocache".   That said, I think "wt" or "write-through" generally
>> > means that writes allocate cachelines and keep them clean by writing to
>> > memory.  So, subsequent reads to the destination will hit the
>> > cachelines.  This is not the case with this interface.
>>
>> True... maybe _nocache_strict()? Or, leave it _wt() until someone
>> comes along and is surprised that the cache is not warm for reads
>> after memcpy_wt(), at which point we can ask "why not just use plain
>> memcpy then?", or set the page-attributes to WT.
>
> Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> no cache line is left around afterwards (dirty or clean)?

Yes, I think "flush" belongs in the name, and to make it easily
grep-able separate from _nocache we can call it _flushcache? An
efficient implementation will use _nocache / non-temporal stores
internally, but external consumers just care about the state of the
cache after the call.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06 13:57                   ` Dan Williams
  (?)
@ 2017-05-07  8:57                     ` Ingo Molnar
  -1 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-07  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> >> wrote:
> >> >  :
> >> >> > > ---
> >> >> > > Changes since the initial RFC:
> >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> >> > > set_memory_wt(), etc. (Ingo)
> >> >> >
> >> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> >> > semantics, not WT semantics.
> >> >>
> >> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> >> non-temporal stores and explicit cache flushing.
> >> >>
> >> >> > How about using "nocache" as it's been
> >> >> > used in __copy_user_nocache()?
> >> >>
> >> >> The difference in my mind is that the "_nocache" suffix indicates
> >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> >> strictly arranges for caches not to contain dirty data upon
> >> >> completion of the routine. For example, non-temporal stores on older
> >> >> x86 cpus could potentially leave dirty data in the cache, so
> >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >> >
> >> > I see.  I agree that its behavior is different from the existing one
> >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> >> > means that writes allocate cachelines and keep them clean by writing to
> >> > memory.  So, subsequent reads to the destination will hit the
> >> > cachelines.  This is not the case with this interface.
> >>
> >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> >> comes along and is surprised that the cache is not warm for reads
> >> after memcpy_wt(), at which point we can ask "why not just use plain
> >> memcpy then?", or set the page-attributes to WT.
> >
> > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> > no cache line is left around afterwards (dirty or clean)?
> 
> Yes, I think "flush" belongs in the name, and to make it easily
> grep-able separate from _nocache we can call it _flushcache? An
> efficient implementation will use _nocache / non-temporal stores
> internally, but external consumers just care about the state of the
> cache after the call.

_flushcache() works for me too.

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-07  8:57                     ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-07  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel,
	ross.zwisler, jack


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> >> wrote:
> >> >  :
> >> >> > > ---
> >> >> > > Changes since the initial RFC:
> >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> >> > > set_memory_wt(), etc. (Ingo)
> >> >> >
> >> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> >> > semantics, not WT semantics.
> >> >>
> >> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> >> non-temporal stores and explicit cache flushing.
> >> >>
> >> >> > How about using "nocache" as it's been
> >> >> > used in __copy_user_nocache()?
> >> >>
> >> >> The difference in my mind is that the "_nocache" suffix indicates
> >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> >> strictly arranges for caches not to contain dirty data upon
> >> >> completion of the routine. For example, non-temporal stores on older
> >> >> x86 cpus could potentially leave dirty data in the cache, so
> >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >> >
> >> > I see.  I agree that its behavior is different from the existing one
> >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> >> > means that writes allocate cachelines and keep them clean by writing to
> >> > memory.  So, subsequent reads to the destination will hit the
> >> > cachelines.  This is not the case with this interface.
> >>
> >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> >> comes along and is surprised that the cache is not warm for reads
> >> after memcpy_wt(), at which point we can ask "why not just use plain
> >> memcpy then?", or set the page-attributes to WT.
> >
> > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> > no cache line is left around afterwards (dirty or clean)?
> 
> Yes, I think "flush" belongs in the name, and to make it easily
> grep-able separate from _nocache we can call it _flushcache? An
> efficient implementation will use _nocache / non-temporal stores
> internally, but external consumers just care about the state of the
> cache after the call.

_flushcache() works for me too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-07  8:57                     ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2017-05-07  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel, linux-block, jmoyer, tglx, hch,
	viro, x86, mawilcox, hpa, linux-nvdimm@lists.01.org, mingo,
	linux-fsdevel, ross.zwisler, jack


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> >> wrote:
> >> >  :
> >> >> > > ---
> >> >> > > Changes since the initial RFC:
> >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> >> > > set_memory_wt(), etc. (Ingo)
> >> >> >
> >> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> >> > semantics, not WT semantics.
> >> >>
> >> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> >> non-temporal stores and explicit cache flushing.
> >> >>
> >> >> > How about using "nocache" as it's been
> >> >> > used in __copy_user_nocache()?
> >> >>
> >> >> The difference in my mind is that the "_nocache" suffix indicates
> >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> >> strictly arranges for caches not to contain dirty data upon
> >> >> completion of the routine. For example, non-temporal stores on older
> >> >> x86 cpus could potentially leave dirty data in the cache, so
> >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >> >
> >> > I see.  I agree that its behavior is different from the existing one
> >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> >> > means that writes allocate cachelines and keep them clean by writing to
> >> > memory.  So, subsequent reads to the destination will hit the
> >> > cachelines.  This is not the case with this interface.
> >>
> >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> >> comes along and is surprised that the cache is not warm for reads
> >> after memcpy_wt(), at which point we can ask "why not just use plain
> >> memcpy then?", or set the page-attributes to WT.
> >
> > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> > no cache line is left around afterwards (dirty or clean)?
> 
> Yes, I think "flush" belongs in the name, and to make it easily
> grep-able separate from _nocache we can call it _flushcache? An
> efficient implementation will use _nocache / non-temporal stores
> internally, but external consumers just care about the state of the
> cache after the call.

_flushcache() works for me too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-07  8:57                     ` Ingo Molnar
  (?)
@ 2017-05-08  3:01                       ` Kani, Toshimitsu
  -1 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-08  3:01 UTC (permalink / raw)
  To: Ingo Molnar, Dan Williams
  Cc: jack, mawilcox, x86, linux-kernel, linux-block, linux-nvdimm,
	mingo, viro, hpa, linux-fsdevel, tglx, hch

> * Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > * Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> > >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> > >> >> wrote:
> > >> >  :
> > >> >> > > ---
> > >> >> > > Changes since the initial RFC:
> > >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > >> >> > > set_memory_wt(), etc. (Ingo)
> > >> >> >
> > >> >> > Sorry I should have said earlier, but I think the term "wt" is
> > >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > >> >> > semantics, not WT semantics.
> > >> >>
> > >> >> The non-temporal stores do, but memcpy_wt() is using a combination
> of
> > >> >> non-temporal stores and explicit cache flushing.
> > >> >>
> > >> >> > How about using "nocache" as it's been
> > >> >> > used in __copy_user_nocache()?
> > >> >>
> > >> >> The difference in my mind is that the "_nocache" suffix indicates
> > >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> > >> >> strictly arranges for caches not to contain dirty data upon
> > >> >> completion of the routine. For example, non-temporal stores on older
> > >> >> x86 cpus could potentially leave dirty data in the cache, so
> > >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> > >> >
> > >> > I see.  I agree that its behavior is different from the existing one
> > >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > >> > means that writes allocate cachelines and keep them clean by writing
> to
> > >> > memory.  So, subsequent reads to the destination will hit the
> > >> > cachelines.  This is not the case with this interface.
> > >>
> > >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> > >> comes along and is surprised that the cache is not warm for reads
> > >> after memcpy_wt(), at which point we can ask "why not just use plain
> > >> memcpy then?", or set the page-attributes to WT.
> > >
> > > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal
> and that
> > > no cache line is left around afterwards (dirty or clean)?
> >
> > Yes, I think "flush" belongs in the name, and to make it easily
> > grep-able separate from _nocache we can call it _flushcache? An
> > efficient implementation will use _nocache / non-temporal stores
> > internally, but external consumers just care about the state of the
> > cache after the call.
> 
> _flushcache() works for me too.
> 

Works for me too.
Thanks,
-Toshi

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08  3:01                       ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-08  3:01 UTC (permalink / raw)
  To: Ingo Molnar, Dan Williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm, mingo, linux-fsdevel, ross.zwisler,
	jack

> * Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > * Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> > >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> > >> >> wrote:
> > >> >  :
> > >> >> > > ---
> > >> >> > > Changes since the initial RFC:
> > >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > >> >> > > set_memory_wt(), etc. (Ingo)
> > >> >> >
> > >> >> > Sorry I should have said earlier, but I think the term "wt" is
> > >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > >> >> > semantics, not WT semantics.
> > >> >>
> > >> >> The non-temporal stores do, but memcpy_wt() is using a combination
> of
> > >> >> non-temporal stores and explicit cache flushing.
> > >> >>
> > >> >> > How about using "nocache" as it's been
> > >> >> > used in __copy_user_nocache()?
> > >> >>
> > >> >> The difference in my mind is that the "_nocache" suffix indicates
> > >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> > >> >> strictly arranges for caches not to contain dirty data upon
> > >> >> completion of the routine. For example, non-temporal stores on older
> > >> >> x86 cpus could potentially leave dirty data in the cache, so
> > >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> > >> >
> > >> > I see.  I agree that its behavior is different from the existing one
> > >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > >> > means that writes allocate cachelines and keep them clean by writing
> to
> > >> > memory.  So, subsequent reads to the destination will hit the
> > >> > cachelines.  This is not the case with this interface.
> > >>
> > >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> > >> comes along and is surprised that the cache is not warm for reads
> > >> after memcpy_wt(), at which point we can ask "why not just use plain
> > >> memcpy then?", or set the page-attributes to WT.
> > >
> > > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal
> and that
> > > no cache line is left around afterwards (dirty or clean)?
> >
> > Yes, I think "flush" belongs in the name, and to make it easily
> > grep-able separate from _nocache we can call it _flushcache? An
> > efficient implementation will use _nocache / non-temporal stores
> > internally, but external consumers just care about the state of the
> > cache after the call.
> 
> _flushcache() works for me too.
> 

Works for me too.
Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08  3:01                       ` Kani, Toshimitsu
  0 siblings, 0 replies; 53+ messages in thread
From: Kani, Toshimitsu @ 2017-05-08  3:01 UTC (permalink / raw)
  To: Ingo Molnar, Dan Williams
  Cc: linux-kernel, linux-block, jmoyer, tglx, hch, viro, x86,
	mawilcox, hpa, linux-nvdimm@lists.01.org, mingo, linux-fsdevel,
	ross.zwisler, jack

> * Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > * Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> > >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> > >> >> wrote:
> > >> >  :
> > >> >> > > ---
> > >> >> > > Changes since the initial RFC:
> > >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > >> >> > > set_memory_wt(), etc. (Ingo)
> > >> >> >
> > >> >> > Sorry I should have said earlier, but I think the term "wt" is
> > >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > >> >> > semantics, not WT semantics.
> > >> >>
> > >> >> The non-temporal stores do, but memcpy_wt() is using a combination
> of
> > >> >> non-temporal stores and explicit cache flushing.
> > >> >>
> > >> >> > How about using "nocache" as it's been
> > >> >> > used in __copy_user_nocache()?
> > >> >>
> > >> >> The difference in my mind is that the "_nocache" suffix indicates
> > >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> > >> >> strictly arranges for caches not to contain dirty data upon
> > >> >> completion of the routine. For example, non-temporal stores on older
> > >> >> x86 cpus could potentially leave dirty data in the cache, so
> > >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> > >> >
> > >> > I see.  I agree that its behavior is different from the existing one
> > >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > >> > means that writes allocate cachelines and keep them clean by writing
> to
> > >> > memory.  So, subsequent reads to the destination will hit the
> > >> > cachelines.  This is not the case with this interface.
> > >>
> > >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> > >> comes along and is surprised that the cache is not warm for reads
> > >> after memcpy_wt(), at which point we can ask "why not just use plain
> > >> memcpy then?", or set the page-attributes to WT.
> > >
> > > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal
> and that
> > > no cache line is left around afterwards (dirty or clean)?
> >
> > Yes, I think "flush" belongs in the name, and to make it easily
> > grep-able separate from _nocache we can call it _flushcache? An
> > efficient implementation will use _nocache / non-temporal stores
> > internally, but external consumers just care about the state of the
> > cache after the call.
> 
> _flushcache() works for me too.
> 

Works for me too.
Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39       ` Dan Williams
  (?)
@ 2017-05-08 20:32         ` Ross Zwisler
  -1 siblings, 0 replies; 53+ messages in thread
From: Ross Zwisler @ 2017-05-08 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, linux-block,
	linux-nvdimm, Ingo Molnar, viro, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, hch

On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9a472e..07ded30c7e89 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
>  extern long __copy_user_nocache(void *dst, const void __user *src,
>  				unsigned size, int zerorest);
>  
> +extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
> +extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
> +			   size_t len);
> +
>  static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  				  unsigned size)
> @@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  	return __copy_user_nocache(dst, src, size, 0);
>  }
>  
> +static inline int
> +__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	kasan_check_write(dst, size);
> +	return __copy_user_wt(dst, src, size);
> +}
> +
>  unsigned long
>  copy_user_handle_tail(char *to, char *from, unsigned len);
>  
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 3b7c40a2e3e1..0aeff66a022f 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -7,6 +7,7 @@
>   */
>  #include <linux/export.h>
>  #include <linux/uaccess.h>
> +#include <linux/highmem.h>
>  
>  /*
>   * Zero Userspace
> @@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>  	clac();
>  	return len;
>  }
> +
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WT
> +/**
> + * clean_cache_range - write back a cache range with CLWB
> + * @vaddr:	virtual start address
> + * @size:	number of bytes to write back
> + *
> + * Write back a cache range using the CLWB (cache line write back)
> + * instruction. Note that @size is internally rounded up to be cache
> + * line size aligned.
> + */
> +static void clean_cache_range(void *addr, size_t size)
> +{
> +	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> +	unsigned long clflush_mask = x86_clflush_size - 1;
> +	void *vend = addr + size;
> +	void *p;
> +
> +	for (p = (void *)((unsigned long)addr & ~clflush_mask);
> +	     p < vend; p += x86_clflush_size)
> +		clwb(p);
> +}
> +
> +long __copy_user_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	unsigned long flushed, dest = (unsigned long) dst;
> +	long rc = __copy_user_nocache(dst, src, size, 0);
> +
> +	/*
> +	 * __copy_user_nocache() uses non-temporal stores for the bulk
> +	 * of the transfer, but we need to manually flush if the
> +	 * transfer is unaligned. A cached memory copy is used when
> +	 * destination or size is not naturally aligned. That is:
> +	 *   - Require 8-byte alignment when size is 8 bytes or larger.
> +	 *   - Require 4-byte alignment when size is 4 bytes.
> +	 */
> +	if (size < 8) {
> +		if (!IS_ALIGNED(dest, 4) || size != 4)
> +			clean_cache_range(dst, 1);
> +	} else {
> +		if (!IS_ALIGNED(dest, 8)) {
> +			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> +			clean_cache_range(dst, 1);
> +		}
> +
> +		flushed = dest - (unsigned long) dst;
> +		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
> +			clean_cache_range(dst + size - 1, 1);
> +	}
> +
> +	return rc;
> +}
> +
> +void memcpy_wt(void *_dst, const void *_src, size_t size)
> +{
> +	unsigned long dest = (unsigned long) _dst;
> +	unsigned long source = (unsigned long) _src;
> +
> +	/* cache copy and flush to align dest */
> +	if (!IS_ALIGNED(dest, 8)) {
> +		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
> +
> +		memcpy((void *) dest, (void *) source, len);
> +		clean_cache_range((void *) dest, len);
> +		dest += len;
> +		source += len;
> +		size -= len;
> +		if (!size)
> +			return;
> +	}
> +
> +	/* 4x8 movnti loop */
> +	while (size >= 32) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movq   8(%0), %%r9\n"
> +		    "movq  16(%0), %%r10\n"
> +		    "movq  24(%0), %%r11\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    "movnti  %%r9,  8(%1)\n"
> +		    "movnti %%r10, 16(%1)\n"
> +		    "movnti %%r11, 24(%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8", "r9", "r10", "r11");
> +		dest += 32;
> +		source += 32;
> +		size -= 32;
> +	}
> +
> +	/* 1x8 movnti loop */
> +	while (size >= 8) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 8;
> +		source += 8;
> +		size -= 8;
> +	}
> +
> +	/* 1x4 movnti loop */
> +	while (size >= 4) {
> +		asm("movl    (%0), %%r8d\n"
> +		    "movnti  %%r8d,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 4;
> +		source += 4;
> +		size -= 4;
> +	}
> +
> +	/* cache copy for remaining bytes */
> +	if (size) {
> +		memcpy((void *) dest, (void *) source, size);
> +		clean_cache_range((void *) dest, size);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(memcpy_wt);

I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
they look correct to me.  The inline assembly for non-temporal copies mixed
with C for loop control is IMHO much easier to follow than the pure assembly
of __copy_user_nocache().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08 20:32         ` Ross Zwisler
  0 siblings, 0 replies; 53+ messages in thread
From: Ross Zwisler @ 2017-05-08 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler

On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9a472e..07ded30c7e89 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
>  extern long __copy_user_nocache(void *dst, const void __user *src,
>  				unsigned size, int zerorest);
>  
> +extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
> +extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
> +			   size_t len);
> +
>  static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  				  unsigned size)
> @@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  	return __copy_user_nocache(dst, src, size, 0);
>  }
>  
> +static inline int
> +__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	kasan_check_write(dst, size);
> +	return __copy_user_wt(dst, src, size);
> +}
> +
>  unsigned long
>  copy_user_handle_tail(char *to, char *from, unsigned len);
>  
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 3b7c40a2e3e1..0aeff66a022f 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -7,6 +7,7 @@
>   */
>  #include <linux/export.h>
>  #include <linux/uaccess.h>
> +#include <linux/highmem.h>
>  
>  /*
>   * Zero Userspace
> @@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>  	clac();
>  	return len;
>  }
> +
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WT
> +/**
> + * clean_cache_range - write back a cache range with CLWB
> + * @vaddr:	virtual start address
> + * @size:	number of bytes to write back
> + *
> + * Write back a cache range using the CLWB (cache line write back)
> + * instruction. Note that @size is internally rounded up to be cache
> + * line size aligned.
> + */
> +static void clean_cache_range(void *addr, size_t size)
> +{
> +	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> +	unsigned long clflush_mask = x86_clflush_size - 1;
> +	void *vend = addr + size;
> +	void *p;
> +
> +	for (p = (void *)((unsigned long)addr & ~clflush_mask);
> +	     p < vend; p += x86_clflush_size)
> +		clwb(p);
> +}
> +
> +long __copy_user_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	unsigned long flushed, dest = (unsigned long) dst;
> +	long rc = __copy_user_nocache(dst, src, size, 0);
> +
> +	/*
> +	 * __copy_user_nocache() uses non-temporal stores for the bulk
> +	 * of the transfer, but we need to manually flush if the
> +	 * transfer is unaligned. A cached memory copy is used when
> +	 * destination or size is not naturally aligned. That is:
> +	 *   - Require 8-byte alignment when size is 8 bytes or larger.
> +	 *   - Require 4-byte alignment when size is 4 bytes.
> +	 */
> +	if (size < 8) {
> +		if (!IS_ALIGNED(dest, 4) || size != 4)
> +			clean_cache_range(dst, 1);
> +	} else {
> +		if (!IS_ALIGNED(dest, 8)) {
> +			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> +			clean_cache_range(dst, 1);
> +		}
> +
> +		flushed = dest - (unsigned long) dst;
> +		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
> +			clean_cache_range(dst + size - 1, 1);
> +	}
> +
> +	return rc;
> +}
> +
> +void memcpy_wt(void *_dst, const void *_src, size_t size)
> +{
> +	unsigned long dest = (unsigned long) _dst;
> +	unsigned long source = (unsigned long) _src;
> +
> +	/* cache copy and flush to align dest */
> +	if (!IS_ALIGNED(dest, 8)) {
> +		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
> +
> +		memcpy((void *) dest, (void *) source, len);
> +		clean_cache_range((void *) dest, len);
> +		dest += len;
> +		source += len;
> +		size -= len;
> +		if (!size)
> +			return;
> +	}
> +
> +	/* 4x8 movnti loop */
> +	while (size >= 32) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movq   8(%0), %%r9\n"
> +		    "movq  16(%0), %%r10\n"
> +		    "movq  24(%0), %%r11\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    "movnti  %%r9,  8(%1)\n"
> +		    "movnti %%r10, 16(%1)\n"
> +		    "movnti %%r11, 24(%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8", "r9", "r10", "r11");
> +		dest += 32;
> +		source += 32;
> +		size -= 32;
> +	}
> +
> +	/* 1x8 movnti loop */
> +	while (size >= 8) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 8;
> +		source += 8;
> +		size -= 8;
> +	}
> +
> +	/* 1x4 movnti loop */
> +	while (size >= 4) {
> +		asm("movl    (%0), %%r8d\n"
> +		    "movnti  %%r8d,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 4;
> +		source += 4;
> +		size -= 4;
> +	}
> +
> +	/* cache copy for remaining bytes */
> +	if (size) {
> +		memcpy((void *) dest, (void *) source, size);
> +		clean_cache_range((void *) dest, size);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(memcpy_wt);

I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
they look correct to me.  The inline assembly for non-temporal copies mixed
with C for loop control is IMHO much easier to follow than the pure assembly
of __copy_user_nocache().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08 20:32         ` Ross Zwisler
  0 siblings, 0 replies; 53+ messages in thread
From: Ross Zwisler @ 2017-05-08 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler

On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9a472e..07ded30c7e89 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
>  extern long __copy_user_nocache(void *dst, const void __user *src,
>  				unsigned size, int zerorest);
>  
> +extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
> +extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
> +			   size_t len);
> +
>  static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  				  unsigned size)
> @@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  	return __copy_user_nocache(dst, src, size, 0);
>  }
>  
> +static inline int
> +__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	kasan_check_write(dst, size);
> +	return __copy_user_wt(dst, src, size);
> +}
> +
>  unsigned long
>  copy_user_handle_tail(char *to, char *from, unsigned len);
>  
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 3b7c40a2e3e1..0aeff66a022f 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -7,6 +7,7 @@
>   */
>  #include <linux/export.h>
>  #include <linux/uaccess.h>
> +#include <linux/highmem.h>
>  
>  /*
>   * Zero Userspace
> @@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>  	clac();
>  	return len;
>  }
> +
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WT
> +/**
> + * clean_cache_range - write back a cache range with CLWB
> + * @vaddr:	virtual start address
> + * @size:	number of bytes to write back
> + *
> + * Write back a cache range using the CLWB (cache line write back)
> + * instruction. Note that @size is internally rounded up to be cache
> + * line size aligned.
> + */
> +static void clean_cache_range(void *addr, size_t size)
> +{
> +	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> +	unsigned long clflush_mask = x86_clflush_size - 1;
> +	void *vend = addr + size;
> +	void *p;
> +
> +	for (p = (void *)((unsigned long)addr & ~clflush_mask);
> +	     p < vend; p += x86_clflush_size)
> +		clwb(p);
> +}
> +
> +long __copy_user_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	unsigned long flushed, dest = (unsigned long) dst;
> +	long rc = __copy_user_nocache(dst, src, size, 0);
> +
> +	/*
> +	 * __copy_user_nocache() uses non-temporal stores for the bulk
> +	 * of the transfer, but we need to manually flush if the
> +	 * transfer is unaligned. A cached memory copy is used when
> +	 * destination or size is not naturally aligned. That is:
> +	 *   - Require 8-byte alignment when size is 8 bytes or larger.
> +	 *   - Require 4-byte alignment when size is 4 bytes.
> +	 */
> +	if (size < 8) {
> +		if (!IS_ALIGNED(dest, 4) || size != 4)
> +			clean_cache_range(dst, 1);
> +	} else {
> +		if (!IS_ALIGNED(dest, 8)) {
> +			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> +			clean_cache_range(dst, 1);
> +		}
> +
> +		flushed = dest - (unsigned long) dst;
> +		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
> +			clean_cache_range(dst + size - 1, 1);
> +	}
> +
> +	return rc;
> +}
> +
> +void memcpy_wt(void *_dst, const void *_src, size_t size)
> +{
> +	unsigned long dest = (unsigned long) _dst;
> +	unsigned long source = (unsigned long) _src;
> +
> +	/* cache copy and flush to align dest */
> +	if (!IS_ALIGNED(dest, 8)) {
> +		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
> +
> +		memcpy((void *) dest, (void *) source, len);
> +		clean_cache_range((void *) dest, len);
> +		dest += len;
> +		source += len;
> +		size -= len;
> +		if (!size)
> +			return;
> +	}
> +
> +	/* 4x8 movnti loop */
> +	while (size >= 32) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movq   8(%0), %%r9\n"
> +		    "movq  16(%0), %%r10\n"
> +		    "movq  24(%0), %%r11\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    "movnti  %%r9,  8(%1)\n"
> +		    "movnti %%r10, 16(%1)\n"
> +		    "movnti %%r11, 24(%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8", "r9", "r10", "r11");
> +		dest += 32;
> +		source += 32;
> +		size -= 32;
> +	}
> +
> +	/* 1x8 movnti loop */
> +	while (size >= 8) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 8;
> +		source += 8;
> +		size -= 8;
> +	}
> +
> +	/* 1x4 movnti loop */
> +	while (size >= 4) {
> +		asm("movl    (%0), %%r8d\n"
> +		    "movnti  %%r8d,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 4;
> +		source += 4;
> +		size -= 4;
> +	}
> +
> +	/* cache copy for remaining bytes */
> +	if (size) {
> +		memcpy((void *) dest, (void *) source, size);
> +		clean_cache_range((void *) dest, size);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(memcpy_wt);

I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
they look correct to me.  The inline assembly for non-temporal copies mixed
with C for loop control is IMHO much easier to follow than the pure assembly
of __copy_user_nocache().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-08 20:32         ` Ross Zwisler
  (?)
@ 2017-05-08 20:40           ` Dan Williams
  -1 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-08 20:40 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Al Viro, Jan Kara, Matthew Wilcox,
	X86 ML, linux-kernel, Christoph Hellwig, linux-block,
	linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner

On Mon, May 8, 2017 at 1:32 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
> they look correct to me.  The inline assembly for non-temporal copies mixed
> with C for loop control is IMHO much easier to follow than the pure assembly
> of __copy_user_nocache().
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks Ross, I appreciate it.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08 20:40           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-08 20:40 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Al Viro, Jan Kara, Matthew Wilcox,
	X86 ML, linux-kernel, Christoph Hellwig, linux-block,
	linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner

On Mon, May 8, 2017 at 1:32 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
> they look correct to me.  The inline assembly for non-temporal copies mixed
> with C for loop control is IMHO much easier to follow than the pure assembly
> of __copy_user_nocache().
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks Ross, I appreciate it.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
@ 2017-05-08 20:40           ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2017-05-08 20:40 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Al Viro, Jan Kara, Matthew Wilcox,
	X86 ML, linux-kernel, Christoph Hellwig, linux-block,
	linux-nvdimm@lists.01.org, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner

On Mon, May 8, 2017 at 1:32 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
> they look correct to me.  The inline assembly for non-temporal copies mixed
> with C for loop control is IMHO much easier to follow than the pure assembly
> of __copy_user_nocache().
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks Ross, I appreciate it.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-05-08 20:40 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-25  1:22 [NAK] copy_from_iter_ops() Al Viro
2017-04-25  2:35 ` Dan Williams
2017-04-26 21:56 ` [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem Dan Williams
2017-04-26 21:56   ` Dan Williams
2017-04-26 21:56   ` Dan Williams
2017-04-27  6:30   ` Ingo Molnar
2017-04-27  6:30     ` Ingo Molnar
2017-04-27  6:30     ` Ingo Molnar
2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
2017-04-28 19:39       ` Dan Williams
2017-04-28 19:39       ` Dan Williams
2017-05-05  6:54       ` Ingo Molnar
2017-05-05  6:54         ` Ingo Molnar
2017-05-05  6:54         ` Ingo Molnar
2017-05-05 14:12         ` Dan Williams
2017-05-05 14:12           ` Dan Williams
2017-05-05 14:12           ` Dan Williams
2017-05-05 20:39       ` Kani, Toshimitsu
2017-05-05 20:39         ` Kani, Toshimitsu
2017-05-05 20:39         ` Kani, Toshimitsu
2017-05-05 20:39         ` Kani, Toshimitsu
2017-05-05 22:25         ` Dan Williams
2017-05-05 22:25           ` Dan Williams
2017-05-05 22:25           ` Dan Williams
2017-05-05 22:44           ` Kani, Toshimitsu
2017-05-05 22:44             ` Kani, Toshimitsu
2017-05-05 22:44             ` Kani, Toshimitsu
2017-05-05 22:44             ` Kani, Toshimitsu
2017-05-06  2:15             ` Dan Williams
2017-05-06  2:15               ` Dan Williams
2017-05-06  2:15               ` Dan Williams
2017-05-06  3:17               ` Kani, Toshimitsu
2017-05-06  3:17                 ` Kani, Toshimitsu
2017-05-06  3:17                 ` Kani, Toshimitsu
2017-05-06  3:17                 ` Kani, Toshimitsu
2017-05-06  9:46               ` Ingo Molnar
2017-05-06  9:46                 ` Ingo Molnar
2017-05-06  9:46                 ` Ingo Molnar
2017-05-06 13:57                 ` Dan Williams
2017-05-06 13:57                   ` Dan Williams
2017-05-06 13:57                   ` Dan Williams
2017-05-07  8:57                   ` Ingo Molnar
2017-05-07  8:57                     ` Ingo Molnar
2017-05-07  8:57                     ` Ingo Molnar
2017-05-08  3:01                     ` Kani, Toshimitsu
2017-05-08  3:01                       ` Kani, Toshimitsu
2017-05-08  3:01                       ` Kani, Toshimitsu
2017-05-08 20:32       ` Ross Zwisler
2017-05-08 20:32         ` Ross Zwisler
2017-05-08 20:32         ` Ross Zwisler
2017-05-08 20:40         ` Dan Williams
2017-05-08 20:40           ` Dan Williams
2017-05-08 20:40           ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.