linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache()
@ 2016-10-26 15:50 Brian Boylston
  2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Brian Boylston @ 2016-10-26 15:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, toshi.kani, oliver.moreno, Brian Boylston,
	Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Al Viro, Dan Williams

Currently, copy_from_iter_nocache() uses "nocache" copies only for
iovecs; bvecs and kvecs use normal copies.  This requires
x86's arch_copy_from_iter_pmem() to issue flushes for bvecs and kvecs,
which has a negative impact on performance when splice()ing from a pipe
to a pmem-backed file on a DAX-mounted file system.

This patch set enables nocache copies in copy_from_iter_nocache() for
bvecs and kvecs for arches that support it (x86 initially).  This provides
a 2-3X improvement in splice() pipe-to-DAX-file throughput.

The first patch introduces memcpy_nocache(), which defaults to just
memcpy(), but for which an x86-specific implementation is provided.

For this patch, I sought to use a static inline function for x86, but
I could not find an obvious header file to put it in.
The build seemed to work when I put it in arch/x86/include/asm/uaccess.h,
but that didn't feel completely right.  I also tried
arch/x86/include/asm/pmem.h, but that doesn't feel right either and it
didn't build.  So, I offer it here in arch/x86/lib/misc.c for discussion.

The second patch updates copy_from_iter_nocache() to use the new
memcpy_nocache().

The third patch removes the flushes from x86's arch_copy_from_iter_pmem().

For testing, I ran fio with the posixaio, mmap, sync, psync, vsync, pvsync,
and splice engines, against both ext4 and xfs.  Only the splice engine
showed any change in performance.  For example, for xfs:

Unpatched 4.8:

Run status group 2 (all jobs):
  WRITE: io=37602MB, aggrb=641724KB/s, minb=641724KB/s, maxb=641724KB/s, mint=60001msec, maxt=60001msec

Run status group 3 (all jobs):
  WRITE: io=36244MB, aggrb=618553KB/s, minb=618553KB/s, maxb=618553KB/s, mint=60001msec, maxt=60001msec

With this patch set:

Run status group 2 (all jobs):
  WRITE: io=128055MB, aggrb=2134.3MB/s, minb=2134.3MB/s, maxb=2134.3MB/s, mint=60001msec, maxt=60001msec

Run status group 3 (all jobs):
  WRITE: io=122586MB, aggrb=2043.8MB/s, minb=2043.8MB/s, maxb=2043.8MB/s, mint=60001msec, maxt=60001msec

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Oliver Moreno <oliver.moreno@hpe.com>

Changes in v2:
  - Split into multiple patches (Toshi Kani)
  - Introduce memcpy_nocache() (Al Viro)
  - Use nocache for kvecs as well

Brian Boylston (3):
  introduce memcpy_nocache()
  use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
  x86: remove unneeded flush in arch_copy_from_iter_pmem()

 arch/x86/include/asm/pmem.h      | 19 +------------------
 arch/x86/include/asm/string_32.h |  3 +++
 arch/x86/include/asm/string_64.h |  3 +++
 arch/x86/lib/misc.c              | 12 ++++++++++++
 include/linux/string.h           | 15 +++++++++++++++
 lib/iov_iter.c                   | 14 +++++++++++---
 6 files changed, 45 insertions(+), 21 deletions(-)

-- 
2.8.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-26 15:50 [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache() Brian Boylston
@ 2016-10-26 15:50 ` Brian Boylston
  2016-10-26 19:30   ` Thomas Gleixner
  2016-10-26 19:51   ` Boaz Harrosh
  2016-10-26 15:50 ` [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache() Brian Boylston
  2016-10-26 15:50 ` [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem() Brian Boylston
  2 siblings, 2 replies; 26+ messages in thread
From: Brian Boylston @ 2016-10-26 15:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, toshi.kani, oliver.moreno, Brian Boylston,
	Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Al Viro, Dan Williams

Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
if possible.  Without arch-specific support, this defaults to just
memcpy().  For now, include arch-specific support for x86.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
---
 arch/x86/include/asm/string_32.h |  3 +++
 arch/x86/include/asm/string_64.h |  3 +++
 arch/x86/lib/misc.c              | 12 ++++++++++++
 include/linux/string.h           | 15 +++++++++++++++
 4 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..64f80c0 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
 
 #endif
 
+#define __HAVE_ARCH_MEMCPY_NOCACHE
+extern void *memcpy_nocache(void *dest, const void *src, size_t count);
+
 #define __HAVE_ARCH_MEMMOVE
 void *memmove(void *dest, const void *src, size_t n);
 
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 90dbbd9..a8fdd55 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
 #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
 #endif
 
+#define __HAVE_ARCH_MEMCPY_NOCACHE
+extern void *memcpy_nocache(void *dest, const void *src, size_t count);
+
 #define __HAVE_ARCH_MEMSET
 void *memset(void *s, int c, size_t n);
 void *__memset(void *s, int c, size_t n);
diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
index 76b373a..c993ab3 100644
--- a/arch/x86/lib/misc.c
+++ b/arch/x86/lib/misc.c
@@ -1,3 +1,6 @@
+#include <linux/export.h>
+#include <linux/uaccess.h>
+
 /*
  * Count the digits of @val including a possible sign.
  *
@@ -19,3 +22,12 @@ int num_digits(int val)
 	}
 	return d;
 }
+
+#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
+void *memcpy_nocache(void *dest, const void *src, size_t count)
+{
+	__copy_from_user_inatomic_nocache(dest, src, count);
+	return dest;
+}
+EXPORT_SYMBOL(memcpy_nocache);
+#endif
diff --git a/include/linux/string.h b/include/linux/string.h
index 26b6f6a..7f40c41 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
 #ifndef __HAVE_ARCH_MEMCPY
 extern void * memcpy(void *,const void *,__kernel_size_t);
 #endif
+
+#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
+/**
+ * memcpy_nocache - Copy one area of memory to another, avoiding the
+ * processor cache if possible
+ * @dest: Where to copy to
+ * @src: Where to copy from
+ * @count: The size of the area.
+ */
+static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
+{
+	return memcpy(dest, src, count);
+}
+#endif
+
 #ifndef __HAVE_ARCH_MEMMOVE
 extern void * memmove(void *,const void *,__kernel_size_t);
 #endif
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
  2016-10-26 15:50 [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache() Brian Boylston
  2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
@ 2016-10-26 15:50 ` Brian Boylston
  2016-10-27  4:46   ` Ross Zwisler
  2016-10-26 15:50 ` [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem() Brian Boylston
  2 siblings, 1 reply; 26+ messages in thread
From: Brian Boylston @ 2016-10-26 15:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, toshi.kani, oliver.moreno, Brian Boylston,
	Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Al Viro, Dan Williams

Update copy_from_iter_nocache() to use memcpy_nocache()
for bvecs and kvecs.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
---
 lib/iov_iter.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7e3138c..71e4531 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -342,6 +342,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
 	kunmap_atomic(from);
 }
 
+static void memcpy_from_page_nocache(char *to, struct page *page, size_t offset, size_t len)
+{
+	char *from = kmap_atomic(page);
+	memcpy_nocache(to, from + offset, len);
+	kunmap_atomic(from);
+}
+
 static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
 {
 	char *to = kmap_atomic(page);
@@ -392,9 +399,10 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
-		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
-				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy_from_page_nocache((to += v.bv_len) - v.bv_len,
+					 v.bv_page, v.bv_offset, v.bv_len),
+		memcpy_nocache((to += v.iov_len) - v.iov_len,
+			       v.iov_base, v.iov_len)
 	)
 
 	return bytes;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()
  2016-10-26 15:50 [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache() Brian Boylston
  2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
  2016-10-26 15:50 ` [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache() Brian Boylston
@ 2016-10-26 15:50 ` Brian Boylston
  2016-10-26 19:57   ` Boaz Harrosh
  2 siblings, 1 reply; 26+ messages in thread
From: Brian Boylston @ 2016-10-26 15:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, toshi.kani, oliver.moreno, Brian Boylston,
	Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Al Viro, Dan Williams

copy_from_iter_nocache() now uses nocache copies for all types of iovecs
on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
---
 arch/x86/include/asm/pmem.h | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 643eba4..2fbf4ae 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
 		clwb(p);
 }
 
-/*
- * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
- * iterators, so for other types (bvec & kvec) we must do a cache write-back.
- */
-static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
-{
-	return iter_is_iovec(i) == false;
-}
-
 /**
  * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
  * @addr:	PMEM destination address
@@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
 static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
 		struct iov_iter *i)
 {
-	size_t len;
-
-	/* TODO: skip the write-back by always using non-temporal stores */
-	len = copy_from_iter_nocache(addr, bytes, i);
-
-	if (__iter_needs_pmem_wb(i))
-		arch_wb_cache_pmem(addr, bytes);
-
-	return len;
+	return copy_from_iter_nocache(addr, bytes, i);
 }
 
 /**
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
@ 2016-10-26 19:30   ` Thomas Gleixner
  2016-10-28  1:52     ` Boylston, Brian
  2016-10-26 19:51   ` Boaz Harrosh
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2016-10-26 19:30 UTC (permalink / raw)
  To: Brian Boylston
  Cc: linux-nvdimm, linux-kernel, toshi.kani, oliver.moreno,
	Ross Zwisler, Ingo Molnar, H. Peter Anvin, x86, Al Viro,
	Dan Williams

On Wed, 26 Oct 2016, Brian Boylston wrote:
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)

> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);

Can we please move that prototype to linux/string.h instead of having it in
both files and just define __HAVE_ARCH_MEMCPY_NOCACHE? And that define can
move into x86/include/asm/string.h. There is no point in duplicating all
that stuff.

> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);

> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count)

> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE

You define __HAVE_ARCH_MEMCPY_NOCACHE for both 32 and 64 bit. What is
this #ifdef for ?

> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> +	__copy_from_user_inatomic_nocache(dest, src, count);

You want to replace a memcpy with this and then use copy from user. Why?
That does not make any sense to me and even if it makes sense for whatever
reason then this wants to be documented and the src argument needs a type
cast to (void __user *)..

Further this uses the inatomic variant. Why? Does a memcpy replacement
suddenly require to be called in atomic context? Again, this makes no sense
and lacks any form of comment. The kernel doc below does not say anything
about that.

Neither provides the changelog any useful information which is
complementary to what the patch actually does.

> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible

I'm not sure whether kerneldoc is happy about that line break. Did you
build the docs?

Make the initial line shorter and explain the functionality in detail
below the arguments

> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.

Nit. Can you please make this in tabular form for readability sake?

 * @dest:	Where to copy to
 * @src:	Where to copy from
 * @count:	The size of the area.

> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> +	return memcpy(dest, src, count);
> +}
> +#endif

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
  2016-10-26 19:30   ` Thomas Gleixner
@ 2016-10-26 19:51   ` Boaz Harrosh
  2016-10-28  1:54     ` Boylston, Brian
  1 sibling, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2016-10-26 19:51 UTC (permalink / raw)
  To: Brian Boylston, linux-nvdimm
  Cc: oliver.moreno, x86, linux-kernel, Ingo Molnar, Al Viro,
	H. Peter Anvin, Thomas Gleixner

On 10/26/2016 06:50 PM, Brian Boylston wrote:
> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
> if possible.  Without arch-specific support, this defaults to just
> memcpy().  For now, include arch-specific support for x86.
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
> ---
>  arch/x86/include/asm/string_32.h |  3 +++
>  arch/x86/include/asm/string_64.h |  3 +++
>  arch/x86/lib/misc.c              | 12 ++++++++++++
>  include/linux/string.h           | 15 +++++++++++++++
>  4 files changed, 33 insertions(+)
> 
> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
> index 3d3e835..64f80c0 100644
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>  
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  void *memmove(void *dest, const void *src, size_t n);
>  
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 90dbbd9..a8fdd55 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMSET
>  void *memset(void *s, int c, size_t n);
>  void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
> index 76b373a..c993ab3 100644
> --- a/arch/x86/lib/misc.c
> +++ b/arch/x86/lib/misc.c
> @@ -1,3 +1,6 @@
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +
>  /*
>   * Count the digits of @val including a possible sign.
>   *
> @@ -19,3 +22,12 @@ int num_digits(int val)
>  	}
>  	return d;
>  }
> +
> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> +	__copy_from_user_inatomic_nocache(dest, src, count);
> +	return dest;
> +}
> +EXPORT_SYMBOL(memcpy_nocache);
> +#endif
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a..7f40c41 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *,const void *,__kernel_size_t);
>  #endif
> +
> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> +	return memcpy(dest, src, count);
> +}

What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.

Feels bad to add yet just another arch switch over __copy_user_nocache

Just feels like too many things that do the same thing. Sigh

Boaz

> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMMOVE
>  extern void * memmove(void *,const void *,__kernel_size_t);
>  #endif
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()
  2016-10-26 15:50 ` [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem() Brian Boylston
@ 2016-10-26 19:57   ` Boaz Harrosh
  2016-10-28  1:58     ` Boylston, Brian
  0 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2016-10-26 19:57 UTC (permalink / raw)
  To: Brian Boylston, linux-nvdimm
  Cc: oliver.moreno, x86, linux-kernel, Ingo Molnar, Al Viro,
	H. Peter Anvin, Thomas Gleixner

On 10/26/2016 06:50 PM, Brian Boylston wrote:
> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
> ---
>  arch/x86/include/asm/pmem.h | 19 +------------------
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 643eba4..2fbf4ae 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
>  		clwb(p);
>  }
>  
> -/*
> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
> - */
> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> -{
> -	return iter_is_iovec(i) == false;
> -}
> -
>  /**
>   * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>   * @addr:	PMEM destination address
> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>  static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>  		struct iov_iter *i)
>  {
> -	size_t len;
> -
> -	/* TODO: skip the write-back by always using non-temporal stores */
> -	len = copy_from_iter_nocache(addr, bytes, i);
> -
> -	if (__iter_needs_pmem_wb(i))
> -		arch_wb_cache_pmem(addr, bytes);
> -
> -	return len;
> +	return copy_from_iter_nocache(addr, bytes, i);

I wish you would remove all this completely. Don't see the point for it anymore 

Thanks a lot for doing this. I have this patch for ages in my trees and was too
lasy to fight them through. Yes it is a big boost for many workloads.

Lots of gratitude
Boaz

>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
  2016-10-26 15:50 ` [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache() Brian Boylston
@ 2016-10-27  4:46   ` Ross Zwisler
  0 siblings, 0 replies; 26+ messages in thread
From: Ross Zwisler @ 2016-10-27  4:46 UTC (permalink / raw)
  To: Brian Boylston
  Cc: linux-nvdimm, linux-kernel, toshi.kani, oliver.moreno,
	Ross Zwisler, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Al Viro, Dan Williams

On Wed, Oct 26, 2016 at 10:50:20AM -0500, Brian Boylston wrote:
> Update copy_from_iter_nocache() to use memcpy_nocache()
> for bvecs and kvecs.
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
> ---
>  lib/iov_iter.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 7e3138c..71e4531 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -342,6 +342,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
>  	kunmap_atomic(from);
>  }
>  
> +static void memcpy_from_page_nocache(char *to, struct page *page, size_t offset, size_t len)
> +{
> +	char *from = kmap_atomic(page);
> +	memcpy_nocache(to, from + offset, len);
> +	kunmap_atomic(from);
> +}
> +
>  static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
>  {
>  	char *to = kmap_atomic(page);
> @@ -392,9 +399,10 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
>  	iterate_and_advance(i, bytes, v,
>  		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
>  					 v.iov_base, v.iov_len),
> -		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
> -				 v.bv_offset, v.bv_len),
> -		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
> +		memcpy_from_page_nocache((to += v.bv_len) - v.bv_len,
> +					 v.bv_page, v.bv_offset, v.bv_len),
> +		memcpy_nocache((to += v.iov_len) - v.iov_len,
> +			       v.iov_base, v.iov_len)
>  	)
>  
>  	return bytes;
> -- 
> 2.8.3

I generally agree with Boaz's comments to your patch 1 - this feels like yet
another layer where we have indirection based on the architecture.

We already have an arch switch at memcpy_to_pmem() based on whether the
architecture supports the PMEM API.  And we already have
__copy_from_user_nocache(), which based on the architecture can map to either
a non-cached memcpy (x86_32, x86_64), or it can just map to a normal memcpy
via __copy_from_user_inatomic() (this happens in include/linux/uaccss.h, which
I believe is used for all non-x86 architectures).

memcpy_nocache() now does the same thing as __copy_from_user_nocache(), where
we define an uncached memcpy for x86_32 and x86_64, and a normal memcpy
variant for other architectures.  But, weirdly, the x86 version maps to our
to __copy_from_user_nocache(), but it on non-x86 we don't map to
__copy_from_user_nocache() and use its fallback, we provide a new fallback of
our own via a direct call to memcpy()?

And, now in copy_from_iter_nocache() on x86 we call __copy_from_user_nocache()
two different ways: directly, and indirectly through memcpy_nocache() and
memcpy_from_page_nocache()=>memcpy_nocache().

Is there any way to simplify all of this?

All in all I'm on board with doing non-temporal copies in all cases, and I
like the idea behind memcpy_from_page_nocache().  I just think there must be a
way to make it simpler.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-26 19:30   ` Thomas Gleixner
@ 2016-10-28  1:52     ` Boylston, Brian
  0 siblings, 0 replies; 26+ messages in thread
From: Boylston, Brian @ 2016-10-28  1:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-nvdimm, linux-kernel, Kani, Toshimitsu, Moreno, Oliver,
	Ross Zwisler, Ingo Molnar, H. Peter Anvin, x86, Al Viro,
	Dan Williams, boylston

Thomas Gleixner wrote on 2016-10-26:
> On Wed, 26 Oct 2016, Brian Boylston wrote:
>> --- a/arch/x86/include/asm/string_32.h
>> +++ b/arch/x86/include/asm/string_32.h
>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
> 
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> 
> Can we please move that prototype to linux/string.h instead of having it in
> both files and just define __HAVE_ARCH_MEMCPY_NOCACHE? And that define can
> move into x86/include/asm/string.h. There is no point in duplicating all
> that stuff.

Yes, sounds good.

> 
>> --- a/arch/x86/include/asm/string_64.h
>> +++ b/arch/x86/include/asm/string_64.h
>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> 
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count)
> 
>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> 
> You define __HAVE_ARCH_MEMCPY_NOCACHE for both 32 and 64 bit. What is
> this #ifdef for ?

Now that you raise the question, I'm not sure, and I see that the
x86 memcpy() definition isn't similarly wrapped.  I can adjust it.

> 
>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> +	__copy_from_user_inatomic_nocache(dest, src, count);
> 
> You want to replace a memcpy with this and then use copy from user. Why?
> That does not make any sense to me and even if it makes sense for whatever
> reason then this wants to be documented and the src argument needs a type
> cast to (void __user *)..

I agree that the documentation needs improvement.  The goal is to memcpy
while avoiding the processor cache.  Following x86's arch_memcpy_to_pmem(),
I was thinking that the user nocache implementation on x86 would work,
but I have none of the comments that arch_memcpy_to_pmem() has...

> 
> Further this uses the inatomic variant. Why? Does a memcpy replacement
> suddenly require to be called in atomic context? Again, this makes no sense
> and lacks any form of comment. The kernel doc below does not say anything
> about that.

I borrowed the use of the inatomic variant from arch_memcpy_to_pmem().

Perhaps I need to look at memcpy_to_pmem() as suggested by Boaz.

> 
> Neither provides the changelog any useful information which is
> complementary to what the patch actually does.
> 
>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>> +/**
>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>> + * processor cache if possible
> 
> I'm not sure whether kerneldoc is happy about that line break. Did you
> build the docs?
> 
> Make the initial line shorter and explain the functionality in detail
> below the arguments
> 
>> + * @dest: Where to copy to
>> + * @src: Where to copy from
>> + * @count: The size of the area.
> 
> Nit. Can you please make this in tabular form for readability sake?

No, I did not try to build the docs.  I'll revisit this and your other
doc-related feedback.

Thanks!
Brian

> 
>  * @dest:	Where to copy to
>  * @src:	Where to copy from
>  * @count:	The size of the area.
>> + */
>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> +	return memcpy(dest, src, count);
>> +}
>> +#endif
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-26 19:51   ` Boaz Harrosh
@ 2016-10-28  1:54     ` Boylston, Brian
  2016-11-01 14:25       ` Boaz Harrosh
  0 siblings, 1 reply; 26+ messages in thread
From: Boylston, Brian @ 2016-10-28  1:54 UTC (permalink / raw)
  To: Boaz Harrosh, linux-nvdimm@lists.01.org
  Cc: Moreno, Oliver, x86, linux-kernel, Ingo Molnar, Al Viro,
	H. Peter Anvin, Thomas Gleixner, boylston

Boaz Harrosh wrote on 2016-10-26:
> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>> if possible.  Without arch-specific support, this defaults to just
>> memcpy().  For now, include arch-specific support for x86.
>> 
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: <x86@kernel.org>
>> Cc: Al Viro <viro@ZenIV.linux.org.uk>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
>> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
>> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
>> ---
>>  arch/x86/include/asm/string_32.h |  3 +++
>>  arch/x86/include/asm/string_64.h |  3 +++
>>  arch/x86/lib/misc.c              | 12 ++++++++++++
>>  include/linux/string.h           | 15 +++++++++++++++
>>  4 files changed, 33 insertions(+)
>> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
>> index 3d3e835..64f80c0 100644
>> --- a/arch/x86/include/asm/string_32.h
>> +++ b/arch/x86/include/asm/string_32.h
>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>> 
>>  #endif
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>> +
>>  #define __HAVE_ARCH_MEMMOVE
>>  void *memmove(void *dest, const void *src, size_t n);
>> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
>> index 90dbbd9..a8fdd55 100644
>> --- a/arch/x86/include/asm/string_64.h
>> +++ b/arch/x86/include/asm/string_64.h
>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>  #endif
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>> +
>>  #define __HAVE_ARCH_MEMSET
>>  void *memset(void *s, int c, size_t n);
>>  void *__memset(void *s, int c, size_t n);
>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>> index 76b373a..c993ab3 100644
>> --- a/arch/x86/lib/misc.c
>> +++ b/arch/x86/lib/misc.c
>> @@ -1,3 +1,6 @@
>> +#include <linux/export.h>
>> +#include <linux/uaccess.h>
>> +
>>  /*
>>   * Count the digits of @val including a possible sign.
>>   *
>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>  	}
>>  	return d;
>>  }
>> +
>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> +	__copy_from_user_inatomic_nocache(dest, src, count);
>> +	return dest;
>> +}
>> +EXPORT_SYMBOL(memcpy_nocache);
>> +#endif
>> diff --git a/include/linux/string.h b/include/linux/string.h
>> index 26b6f6a..7f40c41 100644
>> --- a/include/linux/string.h
>> +++ b/include/linux/string.h
>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>  #ifndef __HAVE_ARCH_MEMCPY
>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>  #endif
>> +
>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>> +/**
>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>> + * processor cache if possible
>> + * @dest: Where to copy to
>> + * @src: Where to copy from
>> + * @count: The size of the area.
>> + */
>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> +	return memcpy(dest, src, count);
>> +}
> 
> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
> 
> Feels bad to add yet just another arch switch over __copy_user_nocache
> 
> Just feels like too many things that do the same thing. Sigh

I agree that this looks like a nicer path.

I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
mean exactly "memcpy nocache".

I had also considered adjusting copy_from_iter_pmem() (also in linux/pmem.h)
to just use memcpy_to_pmem() directly, but then it can't use the goodness
that is the iterate_and_advance() macro in iov_iter.c.

So, I took a shot with a possibly ill-fated memcpy_nocache().  Thoughts on
either of the above two?  Are these even in line with what you were thinking?

Thanks!
Brian

> 
> Boaz
> 
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_MEMMOVE
>>  extern void * memmove(void *,const void *,__kernel_size_t);
>>  #endif

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()
  2016-10-26 19:57   ` Boaz Harrosh
@ 2016-10-28  1:58     ` Boylston, Brian
  0 siblings, 0 replies; 26+ messages in thread
From: Boylston, Brian @ 2016-10-28  1:58 UTC (permalink / raw)
  To: Boaz Harrosh, linux-nvdimm@lists.01.org
  Cc: Moreno, Oliver, x86, linux-kernel, Ingo Molnar, Al Viro,
	H. Peter Anvin, Thomas Gleixner, boylston

Boaz Harrosh wrote on 2016-10-26:
> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
>> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
>> 
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: <x86@kernel.org>
>> Cc: Al Viro <viro@ZenIV.linux.org.uk>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
>> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
>> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
>> ---
>>  arch/x86/include/asm/pmem.h | 19 +------------------
>>  1 file changed, 1 insertion(+), 18 deletions(-)
>> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
>> index 643eba4..2fbf4ae 100644
>> --- a/arch/x86/include/asm/pmem.h
>> +++ b/arch/x86/include/asm/pmem.h
>> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
>>  		clwb(p);
>>  }
>> -/*
>> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
>> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
>> - */
>> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>> -{
>> -	return iter_is_iovec(i) == false;
>> -}
>> -
>>  /**
>>   * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>>   * @addr:	PMEM destination address
>> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>>  static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>>  		struct iov_iter *i)
>>  {
>> -	size_t len;
>> -
>> -	/* TODO: skip the write-back by always using non-temporal stores */
>> -	len = copy_from_iter_nocache(addr, bytes, i);
>> -
>> -	if (__iter_needs_pmem_wb(i))
>> -		arch_wb_cache_pmem(addr, bytes);
>> -
>> -	return len;
>> +	return copy_from_iter_nocache(addr, bytes, i);
> 
> I wish you would remove all this completely. Don't see the point for it anymore
> 

At first, I was just targeting bvecs for the splice() case, so I had left the
check and the flush in there for kvecs.  When doing this v2 of the patch set,
I made kvecs nocache as well, so removed the check and flush, but didn't follow
through completely as you suggest.  Will revisit.

Thanks!
Brian

> Thanks a lot for doing this. I have this patch for ages in my trees and was too
> lasy to fight them through. Yes it is a big boost for many workloads.
> 
> Lots of gratitude
> Boaz
> 
>>  }
>>  /**

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-10-28  1:54     ` Boylston, Brian
@ 2016-11-01 14:25       ` Boaz Harrosh
  2016-12-28 23:43         ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2016-11-01 14:25 UTC (permalink / raw)
  To: Boylston, Brian, linux-nvdimm@lists.01.org
  Cc: Moreno, Oliver, x86, linux-kernel, Ingo Molnar, Al Viro,
	H. Peter Anvin, Thomas Gleixner, boylston

On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible.  Without arch-specific support, this defaults to just
>>> memcpy().  For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: <x86@kernel.org>
>>> Cc: Al Viro <viro@ZenIV.linux.org.uk>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Signed-off-by: Brian Boylston <brian.boylston@hpe.com>
>>> Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
>>> Reported-by: Oliver Moreno <oliver.moreno@hpe.com>
>>> ---
>>>  arch/x86/include/asm/string_32.h |  3 +++
>>>  arch/x86/include/asm/string_64.h |  3 +++
>>>  arch/x86/lib/misc.c              | 12 ++++++++++++
>>>  include/linux/string.h           | 15 +++++++++++++++
>>>  4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>>>
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMMOVE
>>>  void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMSET
>>>  void *memset(void *s, int c, size_t n);
>>>  void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include <linux/export.h>
>>> +#include <linux/uaccess.h>
>>> +
>>>  /*
>>>   * Count the digits of @val including a possible sign.
>>>   *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>>  	}
>>>  	return d;
>>>  }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +	__copy_from_user_inatomic_nocache(dest, src, count);
>>> +	return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>>  #ifndef __HAVE_ARCH_MEMCPY
>>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +	return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
> 
> I agree that this looks like a nicer path.
> 
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
> acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
> mean exactly "memcpy nocache".
> 

I think this is the way to go. In my opinion there is no reason why not to include
pmem.h into lib/iov_iter.c.

And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
so it should be safe to use this for all cases. It is so in the arches that support
this now, and I cannot imagine a theoretical arch that would differ. But let the
specific arch people holler if this steps on their tows, later when they care about
this at all.

> I had also considered adjusting copy_from_iter_pmem() (also in linux/pmem.h)
> to just use memcpy_to_pmem() directly, but then it can't use the goodness
> that is the iterate_and_advance() macro in iov_iter.c.
> 

No please keep all these local to iov_iter.c. Any future changes should be local
to here and fixed in one place.

> So, I took a shot with a possibly ill-fated memcpy_nocache().  Thoughts on
> either of the above two?  Are these even in line with what you were thinking?
> 

Yes thanks again for pushing this. I think it is important. CC me on patches
and I will send you my Review-by
Boaz

> Thanks!
> Brian
> 
>>
>> Boaz
>>
>>> +#endif
>>> +
>>>  #ifndef __HAVE_ARCH_MEMMOVE
>>>  extern void * memmove(void *,const void *,__kernel_size_t);
>>>  #endif

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-11-01 14:25       ` Boaz Harrosh
@ 2016-12-28 23:43         ` Al Viro
  2016-12-29 18:23           ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2016-12-28 23:43 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boylston, Brian, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Tue, Nov 01, 2016 at 04:25:12PM +0200, Boaz Harrosh wrote:

> >> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
> >>
> >> Feels bad to add yet just another arch switch over __copy_user_nocache
> >>
> >> Just feels like too many things that do the same thing. Sigh
> > 
> > I agree that this looks like a nicer path.
> > 
> > I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> > but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
> > acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
> > mean exactly "memcpy nocache".
> > 
> 
> I think this is the way to go. In my opinion there is no reason why not to include
> pmem.h into lib/iov_iter.c.
> 
> And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
> so it should be safe to use this for all cases. It is so in the arches that support
> this now, and I cannot imagine a theoretical arch that would differ. But let the
> specific arch people holler if this steps on their tows, later when they care about
> this at all.
 
First of all, if it's the fastest arch way to bypass cache, why the hell
is it sitting in pmem-related areas?

More to the point, x86 implementation of that thing is tied to uaccess API
for no damn reason whatsoever.  Let's add a real memcpy_nocache() and
be done with that.  I mean, this
        if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
                                __func__, dst, src, rem))
                BUG();
is *screaming* "API misused here".  And let's stay away from the STAC et.al. -
it's pointless for kernel-to-kernel copies.

BTW, your "it's iovec, only non-temporal stores there" logics in
arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
copies will have parts done via normal stores, for another 32bit will
_not_ go for non-caching codepath for short copies.  What semantics do
we really need there?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-12-28 23:43         ` Al Viro
@ 2016-12-29 18:23           ` Dan Williams
  2016-12-30  3:52             ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2016-12-29 18:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Wed, Dec 28, 2016 at 3:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Nov 01, 2016 at 04:25:12PM +0200, Boaz Harrosh wrote:
>
>> >> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
>> >>
>> >> Feels bad to add yet just another arch switch over __copy_user_nocache
>> >>
>> >> Just feels like too many things that do the same thing. Sigh
>> >
>> > I agree that this looks like a nicer path.
>> >
>> > I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
>> > but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
>> > acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
>> > mean exactly "memcpy nocache".
>> >
>>
>> I think this is the way to go. In my opinion there is no reason why not to include
>> pmem.h into lib/iov_iter.c.
>>
>> And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
>> so it should be safe to use this for all cases. It is so in the arches that support
>> this now, and I cannot imagine a theoretical arch that would differ. But let the
>> specific arch people holler if this steps on their tows, later when they care about
>> this at all.
>
> First of all, if it's the fastest arch way to bypass cache, why the hell
> is it sitting in pmem-related areas?

Agreed, pmem has little to do with a cache avoiding memcpy. I believe
there are embedded platforms in the field that have system wide
batteries and arrange for cpu caches to be flushed on power loss. So a
cache avoiding memory copy may not always be the best choice for pmem.

> More to the point, x86 implementation of that thing is tied to uaccess API
> for no damn reason whatsoever.  Let's add a real memcpy_nocache() and
> be done with that.  I mean, this
>         if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
>                                 __func__, dst, src, rem))
>                 BUG();
> is *screaming* "API misused here".  And let's stay away from the STAC et.al. -
> it's pointless for kernel-to-kernel copies.

Yes, that's my turd and I agree we should opt for a generic cache
bypassing copy.

> BTW, your "it's iovec, only non-temporal stores there" logics in
> arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
> copies will have parts done via normal stores, for another 32bit will
> _not_ go for non-caching codepath for short copies.  What semantics do
> we really need there?

For typical pmem platforms we need to make sure all the writes are on
the way to memory such than a later sfence can guarantee that all
previous writes are visible to the platform "ADR" logic.  ADR handles
flushing memory controller write buffers to media. At a minimum
arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
line writeback) of each touched cache line if it is not using a cache
bypassing store.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-12-29 18:23           ` Dan Williams
@ 2016-12-30  3:52             ` Al Viro
  2016-12-30  4:56               ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2016-12-30  3:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Thu, Dec 29, 2016 at 10:23:15AM -0800, Dan Williams wrote:

> > BTW, your "it's iovec, only non-temporal stores there" logics in
> > arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
> > copies will have parts done via normal stores, for another 32bit will
> > _not_ go for non-caching codepath for short copies.  What semantics do
> > we really need there?
> 
> For typical pmem platforms we need to make sure all the writes are on
> the way to memory such than a later sfence can guarantee that all
> previous writes are visible to the platform "ADR" logic.  ADR handles
> flushing memory controller write buffers to media. At a minimum
> arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
> line writeback) of each touched cache line if it is not using a cache
> bypassing store.

Um...  Then we do have a problem - nocache variant of uaccess primitives
does *not* guarantee that clwb is redundant.

What about the requirements of e.g. tcp_sendmsg() with its use of
skb_add_data_nocache()?  What warranties do we need there?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] introduce memcpy_nocache()
  2016-12-30  3:52             ` Al Viro
@ 2016-12-30  4:56               ` Dan Williams
  2016-12-31  2:25                 ` [RFC] memcpy_nocache() and memcpy_writethrough() Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2016-12-30  4:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Thu, Dec 29, 2016 at 7:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Dec 29, 2016 at 10:23:15AM -0800, Dan Williams wrote:
>
>> > BTW, your "it's iovec, only non-temporal stores there" logics in
>> > arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
>> > copies will have parts done via normal stores, for another 32bit will
>> > _not_ go for non-caching codepath for short copies.  What semantics do
>> > we really need there?
>>
>> For typical pmem platforms we need to make sure all the writes are on
>> the way to memory such than a later sfence can guarantee that all
>> previous writes are visible to the platform "ADR" logic.  ADR handles
>> flushing memory controller write buffers to media. At a minimum
>> arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
>> line writeback) of each touched cache line if it is not using a cache
>> bypassing store.
>
> Um...  Then we do have a problem - nocache variant of uaccess primitives
> does *not* guarantee that clwb is redundant.
>
> What about the requirements of e.g. tcp_sendmsg() with its use of
> skb_add_data_nocache()?  What warranties do we need there?

Yes, we need to distinguish the existing "nocache" that tries to avoid
unnecessary cache pollution and this new "must write through" semantic
for writing to persistent memory. I suspect usages of
skb_add_data_nocache() are ok since they are in the transmit path.
Receiving directly into a buffer that is expected to be persisted
immediately is where we would need to be careful, but that is already
backstopped by dirty cacheline tracking. So as far as I can see, we
should only need a new memcpy_writethrough() (?) for the pmem
direct-i/o path at present.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] memcpy_nocache() and memcpy_writethrough()
  2016-12-30  4:56               ` Dan Williams
@ 2016-12-31  2:25                 ` Al Viro
  2017-01-02  2:35                   ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2016-12-31  2:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Thu, Dec 29, 2016 at 08:56:13PM -0800, Dan Williams wrote:

> > Um...  Then we do have a problem - nocache variant of uaccess primitives
> > does *not* guarantee that clwb is redundant.
> >
> > What about the requirements of e.g. tcp_sendmsg() with its use of
> > skb_add_data_nocache()?  What warranties do we need there?
> 
> Yes, we need to distinguish the existing "nocache" that tries to avoid
> unnecessary cache pollution and this new "must write through" semantic
> for writing to persistent memory. I suspect usages of
> skb_add_data_nocache() are ok since they are in the transmit path.
> Receiving directly into a buffer that is expected to be persisted
> immediately is where we would need to be careful, but that is already
> backstopped by dirty cacheline tracking. So as far as I can see, we
> should only need a new memcpy_writethrough() (?) for the pmem
> direct-i/o path at present.

OK...  Right now we have several places playing with nocache:
	* dax_iomap_actor().  Writethrough warranties needed, nocache
side serves to reduce the cache impact *and* avoid the need for clwb
for writethrough.
	* several memcpy_to_pmem() users - acpi_nfit_blk_single_io(),
nsio_rw_bytes(), write_pmem().  No clwb attempted; is it needed there?
	* hfi1_copy_sge().  Cache pollution avoidance?  The source is
in the kernel, looks like memcpy_nocache() candidate.
	* ntb_memcpy_tx().  Really fishy one - it's from kernel to iomem,
with nocache userland->kernel copying primitive abused on x86.  As soon
as e.g. powerpc or sparc grows ARCH_HAS_NOCACHE_UACCESS, we are in trouble
there.  What is it actually trying to achieve?  memcpy_toio() with
cache pollution avoidance?
	* networking copy_from_iter_full_nocache() users - cache pollution
avoidance, AFAICS; no writethrough warranties sought.

Why does pmem need writethrough warranties, anyway?  All explanations I've
found on the net had been along the lines of "we should not store a pointer
to pmem data structure until the structure itself had been committed to
pmem itself" and it looks like something that ought to be a job for barriers
- after all, we don't want the pointer store to be observed by _anything_
in the system until the earlier stores are visible, so what makes pmem
different from e.g. another CPU or a PCI busmaster, or...

I'm trying to figure out what would be the right API here; sure, we can
add separate memcpy_writethrough()/__copy_from_user_inatomic_writethrough()/
copy_from_iter_writethrough(), but I would like to understand what's going
on first.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] memcpy_nocache() and memcpy_writethrough()
  2016-12-31  2:25                 ` [RFC] memcpy_nocache() and memcpy_writethrough() Al Viro
@ 2017-01-02  2:35                   ` Elliott, Robert (Persistent Memory)
  2017-01-02  5:09                     ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-01-02  2:35 UTC (permalink / raw)
  To: Al Viro, Dan Williams
  Cc: Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Al Viro
> Sent: Friday, December 30, 2016 8:26 PM
> Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> 
...
> Why does pmem need writethrough warranties, anyway?  

Using either 
* nontemporal store instructions; or
* following regular store instructions with a sequence of cache flush
and store fence instructions (e.g., clflushopt or clwb + sfence)

ensures that write data has reached an "ADR-safe zone" that the system
promises will be persistent even if there is a surprise power loss or
a CPU suffers from an error that isn't totally catastrophic (e.g., the
CPU getting disconnected from the SDRAM will always lose data on an
NVDIMM-N).

The ACPI NFIT Flush Hints provide a guarantee that data is safe even
in the case of a CPU error, but that feature is not present in all
systems for all types of persistent memory.

> All explanations I've found on the net had been along the lines of
> "we should not store a pointer to pmem data structure until the
> structure itself had been committed to pmem itself" and it looks
> like something that ought to be a job for barriers - after all,
> we don't want the pointer store to be observed by _anything_
> in the system until the earlier stores are visible, so what makes
> pmem different from e.g. another CPU or a PCI busmaster, or...

Newly written data becomes globally visible before it becomes ADR-safe.
This means software could act on the new data before a power loss, then
see the old data reappear after the power loss - not good.  Software
needs to understand that any data in the process of being written is
indeterminate until the persistence guarantee is met.  The BTT shows
one way that software can avoid that problem.

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-02  2:35                   ` Elliott, Robert (Persistent Memory)
@ 2017-01-02  5:09                     ` Al Viro
  2017-01-03 21:14                       ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2017-01-02  5:09 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory)
  Cc: Dan Williams, Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno,
	Oliver, x86, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, boylston, Linus Torvalds

On Mon, Jan 02, 2017 at 02:35:36AM +0000, Elliott, Robert (Persistent Memory) wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Al Viro
> > Sent: Friday, December 30, 2016 8:26 PM
> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> > 
> ...
> > Why does pmem need writethrough warranties, anyway?  
> 
> Using either 
> * nontemporal store instructions; or
> * following regular store instructions with a sequence of cache flush
> and store fence instructions (e.g., clflushopt or clwb + sfence)
> 
> ensures that write data has reached an "ADR-safe zone" that the system
> promises will be persistent even if there is a surprise power loss or
> a CPU suffers from an error that isn't totally catastrophic (e.g., the
> CPU getting disconnected from the SDRAM will always lose data on an
> NVDIMM-N).

Wait a sec...  In which places do you need sfence in all that?  movnt*
itself can be reordered, right?  So using that for copying and storing
the pointer afterwards would still need sfence inbetween, unless I'm
seriously misunderstanding the situation...

> Newly written data becomes globally visible before it becomes ADR-safe.
> This means software could act on the new data before a power loss, then
> see the old data reappear after the power loss - not good.  Software
> needs to understand that any data in the process of being written is
> indeterminate until the persistence guarantee is met.  The BTT shows
> one way that software can avoid that problem.

Joy.  What happens in terms of latency?  I.e. how much of a stall does
clwb inflict?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-02  5:09                     ` Al Viro
@ 2017-01-03 21:14                       ` Dan Williams
  2017-01-03 23:22                         ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2017-01-03 21:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Sun, Jan 1, 2017 at 9:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jan 02, 2017 at 02:35:36AM +0000, Elliott, Robert (Persistent Memory) wrote:
>> > -----Original Message-----
>> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>> > owner@vger.kernel.org] On Behalf Of Al Viro
>> > Sent: Friday, December 30, 2016 8:26 PM
>> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
>> >
>> ...
>> > Why does pmem need writethrough warranties, anyway?
>>
>> Using either
>> * nontemporal store instructions; or
>> * following regular store instructions with a sequence of cache flush
>> and store fence instructions (e.g., clflushopt or clwb + sfence)
>>
>> ensures that write data has reached an "ADR-safe zone" that the system
>> promises will be persistent even if there is a surprise power loss or
>> a CPU suffers from an error that isn't totally catastrophic (e.g., the
>> CPU getting disconnected from the SDRAM will always lose data on an
>> NVDIMM-N).
>
> Wait a sec...  In which places do you need sfence in all that?  movnt*
> itself can be reordered, right?  So using that for copying and storing
> the pointer afterwards would still need sfence inbetween, unless I'm
> seriously misunderstanding the situation...

Robert was describing the overall flow / mechanics, but I think it is
easier to visualize the sfence as a flush command sent to a disk
device with a volatile cache. In fact, that's how we implemented it in
the pmem block device driver. The pmem block device registers itself
as requiring REQ_FLUSH to be sent to persist writes. The driver issues
sfence on the assumption that all writes to pmem have either bypassed
the cache with movnt, or are scheduled for write-back via one of the
flush instructions (clflush, clwb, or clflushopt).

>> Newly written data becomes globally visible before it becomes ADR-safe.
>> This means software could act on the new data before a power loss, then
>> see the old data reappear after the power loss - not good.  Software
>> needs to understand that any data in the process of being written is
>> indeterminate until the persistence guarantee is met.  The BTT shows
>> one way that software can avoid that problem.
>
> Joy.  What happens in terms of latency?  I.e. how much of a stall does
> clwb inflict?

Unlike clflush, clwb is unordered, so it has lower overhead. It
schedules writeback, but does not wait for it to complete. The
clflushopt instruction is also unordered, but in addition to writeback
it also invalidates the line.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-03 21:14                       ` Dan Williams
@ 2017-01-03 23:22                         ` Al Viro
  2017-01-03 23:46                           ` Linus Torvalds
  2017-01-04  1:38                           ` Dan Williams
  0 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2017-01-03 23:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:

> Robert was describing the overall flow / mechanics, but I think it is
> easier to visualize the sfence as a flush command sent to a disk
> device with a volatile cache. In fact, that's how we implemented it in
> the pmem block device driver. The pmem block device registers itself
> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
> sfence on the assumption that all writes to pmem have either bypassed
> the cache with movnt, or are scheduled for write-back via one of the
> flush instructions (clflush, clwb, or clflushopt).

*blink*

1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
having only used movnt; it does not attempt clwb at all.

2) __copy_from_user_nocache() for short copies does not use movnt at all.
In that case neither sfence nor clwb is issued.

3) it uses movnt only for part of copying in case of misaligned copy;
No clwb is issued, but sfence *is* - at the very end in 64bit case,
between movnt and copying the tail - in 32bit one.  Incidentally,
while 64bit case takes care to align the destination for movnt part,
32bit one does not.

How much of the above is broken and what do the callers rely upon?  In
particular, is that sfence the right thing for pmem usecases?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-03 23:22                         ` Al Viro
@ 2017-01-03 23:46                           ` Linus Torvalds
  2017-01-04  0:57                             ` Dan Williams
  2017-01-04  1:38                           ` Dan Williams
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2017-01-03 23:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Dan Williams, Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Tue, Jan 3, 2017 at 3:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.
>
> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using "movnt" by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in the
sense of <linux/uaccess.h>. Get rid of that shit.

So Al - just ignore this whole issue. It's not your headache. Any code
that tries to depend on some non-caching memcpy is terminally buggy,
and those code paths need to fix themselves, not ask others to fix
their braindamage for them.

                 Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-03 23:46                           ` Linus Torvalds
@ 2017-01-04  0:57                             ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2017-01-04  0:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston

On Tue, Jan 3, 2017 at 3:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 3, 2017 at 3:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> having only used movnt; it does not attempt clwb at all.
>>
>> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> In that case neither sfence nor clwb is issued.
>
> Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using "movnt" by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in the
> sense of <linux/uaccess.h>. Get rid of that shit.
>
> So Al - just ignore this whole issue. It's not your headache. Any code
> that tries to depend on some non-caching memcpy is terminally buggy,
> and those code paths need to fix themselves, not ask others to fix
> their braindamage for them.

It's not Al's headache and our usage of __copy_from_user_nocache is a
blatant abuse, but the discussion is worth having because this is not
the first time we've struggled with the pmem api and the balance
between what functionality should be in fs/dax.c vs
drivers/nvdimm/pmem.c.

The stumbling block in the past to relegating all pmem accesses to the
driver is not wanting to further expand block_device_operations with
more dax specifics beyond the ->direct_access() operation we already
have.

I can think of gross ways of moving dax_iomap_actor() into the driver,
but perhaps less gross than burdening the uaccess.h maintainer with
pmem abuses.

This would also allow us to drop the needless cache maintenance for
dax capable drivers like brd that are fronting volatile memory.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-03 23:22                         ` Al Viro
  2017-01-03 23:46                           ` Linus Torvalds
@ 2017-01-04  1:38                           ` Dan Williams
  2017-01-04  1:59                             ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Williams @ 2017-01-04  1:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Tue, Jan 3, 2017 at 3:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:
>
>> Robert was describing the overall flow / mechanics, but I think it is
>> easier to visualize the sfence as a flush command sent to a disk
>> device with a volatile cache. In fact, that's how we implemented it in
>> the pmem block device driver. The pmem block device registers itself
>> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
>> sfence on the assumption that all writes to pmem have either bypassed
>> the cache with movnt, or are scheduled for write-back via one of the
>> flush instructions (clflush, clwb, or clflushopt).
>
> *blink*
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.

Yes, and there was a fix a while back to make sure it always used
movnt so clwb after the fact is not required:

a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
copies properly in __copy_user_nocache()

> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

For the 32bit case, yes, but the pmem driver should warn about this
when it checks platform persistent memory capabilities (i.e. x86 32bit
not supported). Ugh, we may have lost that warning for this specific
case recently, I'll go double check and fix it up.

> 3) it uses movnt only for part of copying in case of misaligned copy;
> No clwb is issued, but sfence *is* - at the very end in 64bit case,
> between movnt and copying the tail - in 32bit one.  Incidentally,
> while 64bit case takes care to align the destination for movnt part,
> 32bit one does not.
>
> How much of the above is broken and what do the callers rely upon?

32bit issues are known, but 64bit path is ok since that fix above.

> In particular, is that sfence the right thing for pmem usecases?

That sfence is not there for pmem purposes. The dax / pmem usage does
not expect memcpy_to_pmem() to fence as it may have more writes to
queue up and amortize all the writes with a later fence. This seems to
be even more evidence for moving this functionality away from the
uaccess routines to somewhere more pmem specific.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-04  1:38                           ` Dan Williams
@ 2017-01-04  1:59                             ` Al Viro
  2017-01-04  2:14                               ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2017-01-04  1:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> > having only used movnt; it does not attempt clwb at all.
> 
> Yes, and there was a fix a while back to make sure it always used
> movnt so clwb after the fact is not required:
> 
> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
> copies properly in __copy_user_nocache()
> 
> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> > In that case neither sfence nor clwb is issued.
> 
> For the 32bit case, yes, but the pmem driver should warn about this
> when it checks platform persistent memory capabilities (i.e. x86 32bit
> not supported). Ugh, we may have lost that warning for this specific
> case recently, I'll go double check and fix it up.
> 
> > 3) it uses movnt only for part of copying in case of misaligned copy;
> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
> > between movnt and copying the tail - in 32bit one.  Incidentally,
> > while 64bit case takes care to align the destination for movnt part,
> > 32bit one does not.
> >
> > How much of the above is broken and what do the callers rely upon?
> 
> 32bit issues are known, but 64bit path is ok since that fix above.

Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
at the damn function - it still does cached stores for until the target is
aligned and it does the same for tail when end of destination is not aligned.
Right there in arch/x86/lib/copy_user_64.S.

> > In particular, is that sfence the right thing for pmem usecases?
> 
> That sfence is not there for pmem purposes. The dax / pmem usage does
> not expect memcpy_to_pmem() to fence as it may have more writes to
> queue up and amortize all the writes with a later fence. This seems to
> be even more evidence for moving this functionality away from the
> uaccess routines to somewhere more pmem specific.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] memcpy_nocache() and memcpy_writethrough()
  2017-01-04  1:59                             ` Al Viro
@ 2017-01-04  2:14                               ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2017-01-04  2:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Elliott, Robert (Persistent Memory),
	Boaz Harrosh, linux-nvdimm@lists.01.org, Moreno, Oliver, x86,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	boylston, Linus Torvalds

On Tue, Jan 3, 2017 at 5:59 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
>> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> > having only used movnt; it does not attempt clwb at all.
>>
>> Yes, and there was a fix a while back to make sure it always used
>> movnt so clwb after the fact is not required:
>>
>> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
>> copies properly in __copy_user_nocache()
>>
>> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> > In that case neither sfence nor clwb is issued.
>>
>> For the 32bit case, yes, but the pmem driver should warn about this
>> when it checks platform persistent memory capabilities (i.e. x86 32bit
>> not supported). Ugh, we may have lost that warning for this specific
>> case recently, I'll go double check and fix it up.
>>
>> > 3) it uses movnt only for part of copying in case of misaligned copy;
>> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
>> > between movnt and copying the tail - in 32bit one.  Incidentally,
>> > while 64bit case takes care to align the destination for movnt part,
>> > 32bit one does not.
>> >
>> > How much of the above is broken and what do the callers rely upon?
>>
>> 32bit issues are known, but 64bit path is ok since that fix above.
>
> Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
> at the damn function - it still does cached stores for until the target is
> aligned and it does the same for tail when end of destination is not aligned.
> Right there in arch/x86/lib/copy_user_64.S.

No, it does not eliminate all cache stores, but the cases where we use
it have naturally aligned targets.

Yes, it is terrible to then call wrap it in a memcpy_to_pmem() wrapper
which does not document these alignment constraints.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-01-04  2:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-26 15:50 [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache() Brian Boylston
2016-10-26 15:50 ` [PATCH v2 1/3] introduce memcpy_nocache() Brian Boylston
2016-10-26 19:30   ` Thomas Gleixner
2016-10-28  1:52     ` Boylston, Brian
2016-10-26 19:51   ` Boaz Harrosh
2016-10-28  1:54     ` Boylston, Brian
2016-11-01 14:25       ` Boaz Harrosh
2016-12-28 23:43         ` Al Viro
2016-12-29 18:23           ` Dan Williams
2016-12-30  3:52             ` Al Viro
2016-12-30  4:56               ` Dan Williams
2016-12-31  2:25                 ` [RFC] memcpy_nocache() and memcpy_writethrough() Al Viro
2017-01-02  2:35                   ` Elliott, Robert (Persistent Memory)
2017-01-02  5:09                     ` Al Viro
2017-01-03 21:14                       ` Dan Williams
2017-01-03 23:22                         ` Al Viro
2017-01-03 23:46                           ` Linus Torvalds
2017-01-04  0:57                             ` Dan Williams
2017-01-04  1:38                           ` Dan Williams
2017-01-04  1:59                             ` Al Viro
2017-01-04  2:14                               ` Dan Williams
2016-10-26 15:50 ` [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache() Brian Boylston
2016-10-27  4:46   ` Ross Zwisler
2016-10-26 15:50 ` [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem() Brian Boylston
2016-10-26 19:57   ` Boaz Harrosh
2016-10-28  1:58     ` Boylston, Brian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).