linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] memcpy_flushcache: use cache flusing for larger lengths
@ 2020-04-07 15:01 Mikulas Patocka
  2020-04-07 16:09 ` Andy Lutomirski
  2020-04-07 17:52 ` Dan Williams
  0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-07 15:01 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra
  Cc: x86, Dan Williams, Linux Kernel Mailing List, dm-devel

[ resending this to x86 maintainers ]

Hi

I tested performance of various methods how to write to optane-based 
persistent memory, and found out that non-temporal stores achieve 
throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt 
or clwb achieve throughput 1.6 GB/s.

memcpy_flushcache uses non-temporal stores, I modified it to use cached 
stores + clflushopt and it improved performance of the dm-writecache 
target significantly:

dm-writecache throughput:
(dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
writecache block size   512             1024            2048            4096
movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s

For block size 512, movnti works better, for larger block sizes, 
clflushopt is better.

I was also testing the novafs filesystem, it is not upstream, but it 
benefitted from similar change in __memcpy_flushcache and 
__copy_user_nocache:
write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s


I submit this patch for __memcpy_flushcache that improves dm-writecache 
performance.

Other ideas - should we introduce memcpy_to_pmem instead of modifying 
memcpy_flushcache and move this logic there? Or should I modify the 
dm-writecache target directly to use clflushopt with no change to the 
architecture-specific code?

Mikulas




From: Mikulas Patocka <mpatocka@redhat.com>

I tested dm-writecache performance on a machine with Optane nvdimm and it
turned out that for larger writes, cached stores + cache flushing perform
better than non-temporal stores. This is the throughput of dm-writecache
measured with this command:
dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct

block size	512		1024		2048		4096
movnti		496 MB/s	642 MB/s	725 MB/s	744 MB/s
clflushopt	373 MB/s	688 MB/s	1.1 GB/s	1.2 GB/s

We can see that for smaller block, movnti performs better, but for larger
blocks, clflushopt has better performance.

This patch changes the function __memcpy_flushcache accordingly, so that
with size >= 768 it performs cached stores and cache flushing. Note that
we must not use the new branch if the CPU doesn't have clflushopt - in
that case, the kernel would use inefficient "clflush" instruction that has
very bad performance.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/lib/usercopy_64.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2020-03-24 15:15:36.644945091 -0400
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2020-03-30 07:17:51.450290007 -0400
@@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
 			return;
 	}
 
+	if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
+		while (!IS_ALIGNED(dest, 64)) {
+			asm("movq    (%0), %%r8\n"
+			    "movnti  %%r8,   (%1)\n"
+			    :: "r" (source), "r" (dest)
+			    : "memory", "r8");
+			dest += 8;
+			source += 8;
+			size -= 8;
+		}
+		do {
+			asm("movq    (%0), %%r8\n"
+			    "movq   8(%0), %%r9\n"
+			    "movq  16(%0), %%r10\n"
+			    "movq  24(%0), %%r11\n"
+			    "movq    %%r8,   (%1)\n"
+			    "movq    %%r9,  8(%1)\n"
+			    "movq   %%r10, 16(%1)\n"
+			    "movq   %%r11, 24(%1)\n"
+			    "movq  32(%0), %%r8\n"
+			    "movq  40(%0), %%r9\n"
+			    "movq  48(%0), %%r10\n"
+			    "movq  56(%0), %%r11\n"
+			    "movq    %%r8, 32(%1)\n"
+			    "movq    %%r9, 40(%1)\n"
+			    "movq   %%r10, 48(%1)\n"
+			    "movq   %%r11, 56(%1)\n"
+			    :: "r" (source), "r" (dest)
+			    : "memory", "r8", "r9", "r10", "r11");
+			clflushopt((void *)dest);
+			dest += 64;
+			source += 64;
+			size -= 64;
+		} while (size >= 64);
+	}
+
 	/* 4x8 movnti loop */
 	while (size >= 32) {
 		asm("movq    (%0), %%r8\n"


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-07 15:01 [PATCH] memcpy_flushcache: use cache flusing for larger lengths Mikulas Patocka
@ 2020-04-07 16:09 ` Andy Lutomirski
  2020-04-07 16:33   ` Mikulas Patocka
  2020-04-07 17:52 ` Dan Williams
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-04-07 16:09 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, x86, Dan Williams, Linux Kernel Mailing List,
	dm-devel


> On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> [ resending this to x86 maintainers ]
> 
> Hi
> 
> I tested performance of various methods how to write to optane-based
> persistent memory, and found out that non-temporal stores achieve 
> throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt 
> or clwb achieve throughput 1.6 GB/s.
> 
> memcpy_flushcache uses non-temporal stores, I modified it to use cached 
> stores + clflushopt and it improved performance of the dm-writecache 
> target significantly:
> 
> dm-writecache throughput:
> (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> writecache block size   512             1024            2048            4096
> movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
> clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
> 
> For block size 512, movnti works better, for larger block sizes, 
> clflushopt is better.
> 
> I was also testing the novafs filesystem, it is not upstream, but it 
> benefitted from similar change in __memcpy_flushcache and 
> __copy_user_nocache:
> write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> 
> 
> I submit this patch for __memcpy_flushcache that improves dm-writecache 
> performance.
> 
> Other ideas - should we introduce memcpy_to_pmem instead of modifying 
> memcpy_flushcache and move this logic there? Or should I modify the 
> dm-writecache target directly to use clflushopt with no change to the 
> architecture-specific code?
> 
> Mikulas
> 
> 
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> 
> I tested dm-writecache performance on a machine with Optane nvdimm and it
> turned out that for larger writes, cached stores + cache flushing perform
> better than non-temporal stores. This is the throughput of dm-writecache
> measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> 
> block size    512        1024        2048        4096
> movnti        496 MB/s    642 MB/s    725 MB/s    744 MB/s
> clflushopt    373 MB/s    688 MB/s    1.1 GB/s    1.2 GB/s
> 
> We can see that for smaller block, movnti performs better, but for larger
> blocks, clflushopt has better performance.
> 
> This patch changes the function __memcpy_flushcache accordingly, so that
> with size >= 768 it performs cached stores and cache flushing. Note that
> we must not use the new branch if the CPU doesn't have clflushopt - in
> that case, the kernel would use inefficient "clflush" instruction that has
> very bad performance.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
> arch/x86/lib/usercopy_64.c |   36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
> 
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c    2020-03-24 15:15:36.644945091 -0400
> +++ linux-2.6/arch/x86/lib/usercopy_64.c    2020-03-30 07:17:51.450290007 -0400
> @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
>            return;
>    }
> 
> +    if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> +        while (!IS_ALIGNED(dest, 64)) {
> +            asm("movq    (%0), %%r8\n"
> +                "movnti  %%r8,   (%1)\n"
> +                :: "r" (source), "r" (dest)
> +                : "memory", "r8");
> +            dest += 8;
> +            source += 8;
> +            size -= 8;
> +        }
> +        do {
> +            asm("movq    (%0), %%r8\n"
> +                "movq   8(%0), %%r9\n"
> +                "movq  16(%0), %%r10\n"
> +                "movq  24(%0), %%r11\n"
> +                "movq    %%r8,   (%1)\n"
> +                "movq    %%r9,  8(%1)\n"
> +                "movq   %%r10, 16(%1)\n"
> +                "movq   %%r11, 24(%1)\n"
> +                "movq  32(%0), %%r8\n"
> +                "movq  40(%0), %%r9\n"
> +                "movq  48(%0), %%r10\n"
> +                "movq  56(%0), %%r11\n"
> +                "movq    %%r8, 32(%1)\n"
> +                "movq    %%r9, 40(%1)\n"
> +                "movq   %%r10, 48(%1)\n"
> +                "movq   %%r11, 56(%1)\n"
> +                :: "r" (source), "r" (dest)
> +                : "memory", "r8", "r9", "r10", "r11");

Does this actually work better than the corresponding C code?

Also, that memory clobber probably isn’t doing your code generation any favors.  Experimentally, you have the constraints wrong. An “r” constraint doesn’t tell GCC that you are dereferencing the pointer.  You need to use “m” with a correctly-sized type.  But I bet plain C is at least as good.

> +            clflushopt((void *)dest);
> +            dest += 64;
> +            source += 64;
> +            size -= 64;
> +        } while (size >= 64);
> +    }
> +
>    /* 4x8 movnti loop */
>    while (size >= 32) {
>        asm("movq    (%0), %%r8\n"
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-07 16:09 ` Andy Lutomirski
@ 2020-04-07 16:33   ` Mikulas Patocka
  0 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-07 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, x86, Dan Williams, Linux Kernel Mailing List,
	dm-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5746 bytes --]



On Tue, 7 Apr 2020, Andy Lutomirski wrote:

> 
> > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > [ resending this to x86 maintainers ]
> > 
> > Hi
> > 
> > I tested performance of various methods how to write to optane-based
> > persistent memory, and found out that non-temporal stores achieve 
> > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt 
> > or clwb achieve throughput 1.6 GB/s.
> > 
> > memcpy_flushcache uses non-temporal stores, I modified it to use cached 
> > stores + clflushopt and it improved performance of the dm-writecache 
> > target significantly:
> > 
> > dm-writecache throughput:
> > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > writecache block size   512             1024            2048            4096
> > movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
> > clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
> > 
> > For block size 512, movnti works better, for larger block sizes, 
> > clflushopt is better.
> > 
> > I was also testing the novafs filesystem, it is not upstream, but it 
> > benefitted from similar change in __memcpy_flushcache and 
> > __copy_user_nocache:
> > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> > 
> > 
> > I submit this patch for __memcpy_flushcache that improves dm-writecache 
> > performance.
> > 
> > Other ideas - should we introduce memcpy_to_pmem instead of modifying 
> > memcpy_flushcache and move this logic there? Or should I modify the 
> > dm-writecache target directly to use clflushopt with no change to the 
> > architecture-specific code?
> > 
> > Mikulas
> > 
> > 
> > 
> > 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > I tested dm-writecache performance on a machine with Optane nvdimm and it
> > turned out that for larger writes, cached stores + cache flushing perform
> > better than non-temporal stores. This is the throughput of dm-writecache
> > measured with this command:
> > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> > 
> > block size    512        1024        2048        4096
> > movnti        496 MB/s    642 MB/s    725 MB/s    744 MB/s
> > clflushopt    373 MB/s    688 MB/s    1.1 GB/s    1.2 GB/s
> > 
> > We can see that for smaller block, movnti performs better, but for larger
> > blocks, clflushopt has better performance.
> > 
> > This patch changes the function __memcpy_flushcache accordingly, so that
> > with size >= 768 it performs cached stores and cache flushing. Note that
> > we must not use the new branch if the CPU doesn't have clflushopt - in
> > that case, the kernel would use inefficient "clflush" instruction that has
> > very bad performance.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > ---
> > arch/x86/lib/usercopy_64.c |   36 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 36 insertions(+)
> > 
> > Index: linux-2.6/arch/x86/lib/usercopy_64.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c    2020-03-24 15:15:36.644945091 -0400
> > +++ linux-2.6/arch/x86/lib/usercopy_64.c    2020-03-30 07:17:51.450290007 -0400
> > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> >            return;
> >    }
> > 
> > +    if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> > +        while (!IS_ALIGNED(dest, 64)) {
> > +            asm("movq    (%0), %%r8\n"
> > +                "movnti  %%r8,   (%1)\n"
> > +                :: "r" (source), "r" (dest)
> > +                : "memory", "r8");
> > +            dest += 8;
> > +            source += 8;
> > +            size -= 8;
> > +        }
> > +        do {
> > +            asm("movq    (%0), %%r8\n"
> > +                "movq   8(%0), %%r9\n"
> > +                "movq  16(%0), %%r10\n"
> > +                "movq  24(%0), %%r11\n"
> > +                "movq    %%r8,   (%1)\n"
> > +                "movq    %%r9,  8(%1)\n"
> > +                "movq   %%r10, 16(%1)\n"
> > +                "movq   %%r11, 24(%1)\n"
> > +                "movq  32(%0), %%r8\n"
> > +                "movq  40(%0), %%r9\n"
> > +                "movq  48(%0), %%r10\n"
> > +                "movq  56(%0), %%r11\n"
> > +                "movq    %%r8, 32(%1)\n"
> > +                "movq    %%r9, 40(%1)\n"
> > +                "movq   %%r10, 48(%1)\n"
> > +                "movq   %%r11, 56(%1)\n"
> > +                :: "r" (source), "r" (dest)
> > +                : "memory", "r8", "r9", "r10", "r11");
> 
> Does this actually work better than the corresponding C code?
> 
> Also, that memory clobber probably isn’t doing your code generation any 
> favors.  Experimentally, you have the constraints wrong. An “r” 

The existing "movnti" loop uses exactly the same constraints (and the 
"memory" clobber).

> constraint doesn’t tell GCC that you are dereferencing the pointer.  
> You need to use “m” with a correctly-sized type.

But you would use
	"=m"(*(char *)dest),"=m"(*((char *)dest + 8)),"=m"((char *)dest + 16))...
and so on, until you run out of argument numbers.

> But I bet plain C is at least as good.

I tried to replace it with
	memcpy((void *)dest, (void *)src, 64);

The compiler inlined the memcpy function into 8 loads and 8 stores. 
However, the whole function __memcpy_flushcache consumed one more saved 
register and the machine code was a few bytes longer.

Mikulas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-07 15:01 [PATCH] memcpy_flushcache: use cache flusing for larger lengths Mikulas Patocka
  2020-04-07 16:09 ` Andy Lutomirski
@ 2020-04-07 17:52 ` Dan Williams
  2020-04-08 18:54   ` Mikulas Patocka
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Williams @ 2020-04-07 17:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> [ resending this to x86 maintainers ]
>
> Hi
>
> I tested performance of various methods how to write to optane-based
> persistent memory, and found out that non-temporal stores achieve
> throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> or clwb achieve throughput 1.6 GB/s.
>
> memcpy_flushcache uses non-temporal stores, I modified it to use cached
> stores + clflushopt and it improved performance of the dm-writecache
> target significantly:
>
> dm-writecache throughput:
> (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> writecache block size   512             1024            2048            4096
> movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
> clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
>
> For block size 512, movnti works better, for larger block sizes,
> clflushopt is better.

This should use clwb instead of clflushopt, the clwb macri
automatically converts back to clflushopt if clwb is not supported.

>
> I was also testing the novafs filesystem, it is not upstream, but it
> benefitted from similar change in __memcpy_flushcache and
> __copy_user_nocache:
> write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
>
>
> I submit this patch for __memcpy_flushcache that improves dm-writecache
> performance.
>
> Other ideas - should we introduce memcpy_to_pmem instead of modifying
> memcpy_flushcache and move this logic there? Or should I modify the
> dm-writecache target directly to use clflushopt with no change to the
> architecture-specific code?

This also needs to mention your analysis that showed that this can
have negative cache pollution effects [1], so I'm not sure how to
decide when to make the tradeoff. Once we have movdir64b the tradeoff
equation changes yet again:

[1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/


>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> I tested dm-writecache performance on a machine with Optane nvdimm and it
> turned out that for larger writes, cached stores + cache flushing perform
> better than non-temporal stores. This is the throughput of dm-writecache
> measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
>
> block size      512             1024            2048            4096
> movnti          496 MB/s        642 MB/s        725 MB/s        744 MB/s
> clflushopt      373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
>
> We can see that for smaller block, movnti performs better, but for larger
> blocks, clflushopt has better performance.
>
> This patch changes the function __memcpy_flushcache accordingly, so that
> with size >= 768 it performs cached stores and cache flushing. Note that
> we must not use the new branch if the CPU doesn't have clflushopt - in
> that case, the kernel would use inefficient "clflush" instruction that has
> very bad performance.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  arch/x86/lib/usercopy_64.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-03-24 15:15:36.644945091 -0400
> +++ linux-2.6/arch/x86/lib/usercopy_64.c        2020-03-30 07:17:51.450290007 -0400
> @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
>                         return;
>         }
>
> +       if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> +               while (!IS_ALIGNED(dest, 64)) {
> +                       asm("movq    (%0), %%r8\n"
> +                           "movnti  %%r8,   (%1)\n"
> +                           :: "r" (source), "r" (dest)
> +                           : "memory", "r8");
> +                       dest += 8;
> +                       source += 8;
> +                       size -= 8;
> +               }
> +               do {
> +                       asm("movq    (%0), %%r8\n"
> +                           "movq   8(%0), %%r9\n"
> +                           "movq  16(%0), %%r10\n"
> +                           "movq  24(%0), %%r11\n"
> +                           "movq    %%r8,   (%1)\n"
> +                           "movq    %%r9,  8(%1)\n"
> +                           "movq   %%r10, 16(%1)\n"
> +                           "movq   %%r11, 24(%1)\n"
> +                           "movq  32(%0), %%r8\n"
> +                           "movq  40(%0), %%r9\n"
> +                           "movq  48(%0), %%r10\n"
> +                           "movq  56(%0), %%r11\n"
> +                           "movq    %%r8, 32(%1)\n"
> +                           "movq    %%r9, 40(%1)\n"
> +                           "movq   %%r10, 48(%1)\n"
> +                           "movq   %%r11, 56(%1)\n"
> +                           :: "r" (source), "r" (dest)
> +                           : "memory", "r8", "r9", "r10", "r11");
> +                       clflushopt((void *)dest);
> +                       dest += 64;
> +                       source += 64;
> +                       size -= 64;
> +               } while (size >= 64);
> +       }
> +
>         /* 4x8 movnti loop */
>         while (size >= 32) {
>                 asm("movq    (%0), %%r8\n"
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-07 17:52 ` Dan Williams
@ 2020-04-08 18:54   ` Mikulas Patocka
  2020-04-08 19:29     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-08 18:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development



On Tue, 7 Apr 2020, Dan Williams wrote:

> On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > [ resending this to x86 maintainers ]
> >
> > Hi
> >
> > I tested performance of various methods how to write to optane-based
> > persistent memory, and found out that non-temporal stores achieve
> > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > or clwb achieve throughput 1.6 GB/s.
> >
> > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > stores + clflushopt and it improved performance of the dm-writecache
> > target significantly:
> >
> > dm-writecache throughput:
> > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > writecache block size   512             1024            2048            4096
> > movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
> > clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
> >
> > For block size 512, movnti works better, for larger block sizes,
> > clflushopt is better.
> 
> This should use clwb instead of clflushopt, the clwb macri
> automatically converts back to clflushopt if clwb is not supported.

But we want to invalidate cache, we do not expect CPU to access these data
anymore (it will be accessed by a DMA engine during writeback).

> > I was also testing the novafs filesystem, it is not upstream, but it
> > benefitted from similar change in __memcpy_flushcache and
> > __copy_user_nocache:
> > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> >
> >
> > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > performance.
> >
> > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > memcpy_flushcache and move this logic there? Or should I modify the
> > dm-writecache target directly to use clflushopt with no change to the
> > architecture-specific code?
> 
> This also needs to mention your analysis that showed that this can
> have negative cache pollution effects [1], so I'm not sure how to
> decide when to make the tradeoff. Once we have movdir64b the tradeoff
> equation changes yet again:
> 
> [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/

I analyzed it some more. I have created this program that tests writecache 
w.r.t. cache pollution:

http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c

It fills the cache with a chain of random pointers and then walks these 
pointers to evaluate cache pollution. Between the walks, it writes data to 
the dm-writecache target.

With the original kernel, the result is:
8503 - 11366
real    0m7.985s
user    0m0.585s
sys     0m7.390s

With dm-writecache hacked to use cached writes + clflushopt:
8513 - 11379
real    0m5.045s
user    0m0.670s
sys     0m4.365s

So, the hacked dm-writecache is significantly faster, while the cache 
micro-benchmark doesn't show any more cache pollution.

That's for dm-writecache. Are there some other significant users of 
memcpy_flushcache that need to be checked?

Mikulas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-08 18:54   ` Mikulas Patocka
@ 2020-04-08 19:29     ` Dan Williams
  2020-04-09 14:36       ` Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2020-04-08 19:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

On Wed, Apr 8, 2020 at 11:54 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Tue, 7 Apr 2020, Dan Williams wrote:
>
> > On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > [ resending this to x86 maintainers ]
> > >
> > > Hi
> > >
> > > I tested performance of various methods how to write to optane-based
> > > persistent memory, and found out that non-temporal stores achieve
> > > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > > or clwb achieve throughput 1.6 GB/s.
> > >
> > > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > > stores + clflushopt and it improved performance of the dm-writecache
> > > target significantly:
> > >
> > > dm-writecache throughput:
> > > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > > writecache block size   512             1024            2048            4096
> > > movnti                  496 MB/s        642 MB/s        725 MB/s        744 MB/s
> > > clflushopt              373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
> > >
> > > For block size 512, movnti works better, for larger block sizes,
> > > clflushopt is better.
> >
> > This should use clwb instead of clflushopt, the clwb macri
> > automatically converts back to clflushopt if clwb is not supported.
>
> But we want to invalidate cache, we do not expect CPU to access these data
> anymore (it will be accessed by a DMA engine during writeback).

The cluflushopt and clwb instructions should have identical overhead,
but clwb wins on the rare chance the written data is needed again
soon. If it is never needed again then the cost of dropping a clean
cache line is the same as if the line was invalidated in the first
instance. In both cases (clflushopt and clwb) the snoop traffic
overhead is still paid whether the written-back line is still present
in the cache or not.

>
> > > I was also testing the novafs filesystem, it is not upstream, but it
> > > benefitted from similar change in __memcpy_flushcache and
> > > __copy_user_nocache:
> > > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> > >
> > >
> > > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > > performance.
> > >
> > > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > > memcpy_flushcache and move this logic there? Or should I modify the
> > > dm-writecache target directly to use clflushopt with no change to the
> > > architecture-specific code?
> >
> > This also needs to mention your analysis that showed that this can
> > have negative cache pollution effects [1], so I'm not sure how to
> > decide when to make the tradeoff. Once we have movdir64b the tradeoff
> > equation changes yet again:
> >
> > [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/
>
> I analyzed it some more. I have created this program that tests writecache
> w.r.t. cache pollution:
>
> http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c
>
> It fills the cache with a chain of random pointers and then walks these
> pointers to evaluate cache pollution. Between the walks, it writes data to
> the dm-writecache target.
>
> With the original kernel, the result is:
> 8503 - 11366
> real    0m7.985s
> user    0m0.585s
> sys     0m7.390s
>
> With dm-writecache hacked to use cached writes + clflushopt:
> 8513 - 11379
> real    0m5.045s
> user    0m0.670s
> sys     0m4.365s
>
> So, the hacked dm-writecache is significantly faster, while the cache
> micro-benchmark doesn't show any more cache pollution.

Nice. These are now the pmem numbers, or dram? Otherwise, what changed
that was making nt-writes on pmem perform better compared to your
previous test? I'm just trying to track the results.

> That's for dm-writecache. Are there some other significant users of
> memcpy_flushcache that need to be checked?

The only other user is direct and dax-I/O to the pmem driver.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-08 19:29     ` Dan Williams
@ 2020-04-09 14:36       ` Mikulas Patocka
  2020-04-16  8:24         ` Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-09 14:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development



On Wed, 8 Apr 2020, Dan Williams wrote:

> On Wed, Apr 8, 2020 at 11:54 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> >
> > On Tue, 7 Apr 2020, Dan Williams wrote:
> >
> > > On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > >
> > > This should use clwb instead of clflushopt, the clwb macri
> > > automatically converts back to clflushopt if clwb is not supported.
> >
> > But we want to invalidate cache, we do not expect CPU to access these data
> > anymore (it will be accessed by a DMA engine during writeback).
> 
> The cluflushopt and clwb instructions should have identical overhead,
> but clwb wins on the rare chance the written data is needed again
> soon. If it is never needed again then the cost of dropping a clean
> cache line is the same as if the line was invalidated in the first
> instance. In both cases (clflushopt and clwb) the snoop traffic
> overhead is still paid whether the written-back line is still present
> in the cache or not.

But my concern is that clflushopt removes the line from the cache and 
makes room for another line (this is desired behavior) - clwb keeps the 
line cached and the line would have to compete with other cache lines in 
the same associative set.

Do you know how does the CPU select the cache line to be replaced?

dm-writecache is intended to be used for workloads like database logs that 
need extra-low commit latency. The committed data is not read back during 
normal workload.

> > > > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > > > memcpy_flushcache and move this logic there? Or should I modify the
> > > > dm-writecache target directly to use clflushopt with no change to the
> > > > architecture-specific code?
> > >
> > > This also needs to mention your analysis that showed that this can
> > > have negative cache pollution effects [1], so I'm not sure how to
> > > decide when to make the tradeoff. Once we have movdir64b the tradeoff
> > > equation changes yet again:
> > >
> > > [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/
> >
> > I analyzed it some more. I have created this program that tests writecache
> > w.r.t. cache pollution:
> >
> > http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c
> >
> > It fills the cache with a chain of random pointers and then walks these
> > pointers to evaluate cache pollution. Between the walks, it writes data to
> > the dm-writecache target.
> >
> > With the original kernel, the result is:
> > 8503 - 11366
> > real    0m7.985s
> > user    0m0.585s
> > sys     0m7.390s
> >
> > With dm-writecache hacked to use cached writes + clflushopt:
> > 8513 - 11379
> > real    0m5.045s
> > user    0m0.670s
> > sys     0m4.365s
> >
> > So, the hacked dm-writecache is significantly faster, while the cache
> > micro-benchmark doesn't show any more cache pollution.
> 
> Nice. These are now the pmem numbers, or dram?

pmem


With dm-writecache on emulated pmem (with the memmap argument), we get

With the original kernel:
8508 - 11378
real    0m4.960s
user    0m0.638s
sys     0m4.312s

With dm-writecache hacked to use cached writes + clflushopt:
8505 - 11378
real    0m4.151s
user    0m0.560s
sys     0m3.582s

So - clflushopt is still slightly better.

> Otherwise, what changed that was making nt-writes on pmem perform better 
> compared to your previous test? I'm just trying to track the results.

I re-ran the previous test 
( http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test.c )
and the result is this:

Write + clflushopt:
./l1-test /dev/ram0 f
8502 - 22616
./l1-test /dev/dax3.0 f
8502 - 22902
./l1-test /dev/dax4.0 f
8500 - 11970

Write + clwb:
./l1-test /dev/ram0 w
8502 - 22602
./l1-test /dev/dax3.0 w
8502 - 22454
./l1-test /dev/dax4.0 w
8502 - 11566

Non-temporal stores:
./l1-test /dev/ram0 n
8504 - 22162
./l1-test /dev/dax3.0 n
8502 - 12336
./l1-test /dev/dax4.0 n
8502 - 10662

(/dev/dax3.0 is the real persistent memory, /dev/dax4.0 is pmem emulated 
with the memmap parameter)

"./l1-test /dev/ram0 n" is slower than "./l1-test /dev/dax4.0 n" while 
both of these tests are on RAM. The pmem is mapped with large pages and 
mem map for ramdisk is not - perhaps this is making the difference?

"./l1-test /dev/dax3.0 n" is better than "./l1-test /dev/dax3.0 w" and 
"./l1-test /dev/dax3.0 f" - although the benchmaks done on dm-writecache 
show that cached writes + clflushopt perform better. I don't know why 
there is this disparity.

> > That's for dm-writecache. Are there some other significant users of
> > memcpy_flushcache that need to be checked?
> 
> The only other user is direct and dax-I/O to the pmem driver.

Mikulas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-09 14:36       ` Mikulas Patocka
@ 2020-04-16  8:24         ` Mikulas Patocka
  2020-04-16 18:28           ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-16  8:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development



On Thu, 9 Apr 2020, Mikulas Patocka wrote:

> With dm-writecache on emulated pmem (with the memmap argument), we get
> 
> With the original kernel:
> 8508 - 11378
> real    0m4.960s
> user    0m0.638s
> sys     0m4.312s
> 
> With dm-writecache hacked to use cached writes + clflushopt:
> 8505 - 11378
> real    0m4.151s
> user    0m0.560s
> sys     0m3.582s

I did some multithreaded tests: 
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt

And it turns out that for singlethreaded access, write+clwb performs 
better, while for multithreaded access, non-temporal stores perform 
better.

1       sequential write-nt 8 bytes             1.3 GB/s
2       sequential write-nt 8 bytes             2.5 GB/s
3       sequential write-nt 8 bytes             2.8 GB/s
4       sequential write-nt 8 bytes             2.8 GB/s
5       sequential write-nt 8 bytes             2.5 GB/s

1       sequential write 8 bytes + clwb         1.6 GB/s
2       sequential write 8 bytes + clwb         2.4 GB/s
3       sequential write 8 bytes + clwb         1.7 GB/s
4       sequential write 8 bytes + clwb         1.2 GB/s
5       sequential write 8 bytes + clwb         0.8 GB/s

For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write 
8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better 
throughput.

The dm-writecache target is singlethreaded (all the copying is done while 
holding the writecache lock), so it benefits from clwb.

Should memcpy_flushcache be changed to write+clwb? Or are there some 
multithreaded users of memcpy_flushcache that would be hurt by this 
change?

Mikulas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
  2020-04-16  8:24         ` Mikulas Patocka
@ 2020-04-16 18:28           ` Dan Williams
  2020-04-17 12:47             ` [PATCH] x86: introduce memcpy_flushcache_clflushopt Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2020-04-16 18:28 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Thu, 9 Apr 2020, Mikulas Patocka wrote:
>
> > With dm-writecache on emulated pmem (with the memmap argument), we get
> >
> > With the original kernel:
> > 8508 - 11378
> > real    0m4.960s
> > user    0m0.638s
> > sys     0m4.312s
> >
> > With dm-writecache hacked to use cached writes + clflushopt:
> > 8505 - 11378
> > real    0m4.151s
> > user    0m0.560s
> > sys     0m3.582s
>
> I did some multithreaded tests:
> http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
>
> And it turns out that for singlethreaded access, write+clwb performs
> better, while for multithreaded access, non-temporal stores perform
> better.
>
> 1       sequential write-nt 8 bytes             1.3 GB/s
> 2       sequential write-nt 8 bytes             2.5 GB/s
> 3       sequential write-nt 8 bytes             2.8 GB/s
> 4       sequential write-nt 8 bytes             2.8 GB/s
> 5       sequential write-nt 8 bytes             2.5 GB/s
>
> 1       sequential write 8 bytes + clwb         1.6 GB/s
> 2       sequential write 8 bytes + clwb         2.4 GB/s
> 3       sequential write 8 bytes + clwb         1.7 GB/s
> 4       sequential write 8 bytes + clwb         1.2 GB/s
> 5       sequential write 8 bytes + clwb         0.8 GB/s
>
> For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> throughput.
>
> The dm-writecache target is singlethreaded (all the copying is done while
> holding the writecache lock), so it benefits from clwb.
>
> Should memcpy_flushcache be changed to write+clwb? Or are there some
> multithreaded users of memcpy_flushcache that would be hurt by this
> change?

Maybe this is asking for a specific memcpy_flushcache_inatomic()
implementation for your use case, but leave nt-writes for the general
case?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-16 18:28           ` Dan Williams
@ 2020-04-17 12:47             ` Mikulas Patocka
  2020-04-17 17:57               ` Dan Williams
  2020-04-18 13:27               ` [PATCH] x86: introduce memcpy_flushcache_clflushopt David Laight
  0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-17 12:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development



On Thu, 16 Apr 2020, Dan Williams wrote:

> On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> >
> > On Thu, 9 Apr 2020, Mikulas Patocka wrote:
> >
> > > With dm-writecache on emulated pmem (with the memmap argument), we get
> > >
> > > With the original kernel:
> > > 8508 - 11378
> > > real    0m4.960s
> > > user    0m0.638s
> > > sys     0m4.312s
> > >
> > > With dm-writecache hacked to use cached writes + clflushopt:
> > > 8505 - 11378
> > > real    0m4.151s
> > > user    0m0.560s
> > > sys     0m3.582s
> >
> > I did some multithreaded tests:
> > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
> >
> > And it turns out that for singlethreaded access, write+clwb performs
> > better, while for multithreaded access, non-temporal stores perform
> > better.
> >
> > 1       sequential write-nt 8 bytes             1.3 GB/s
> > 2       sequential write-nt 8 bytes             2.5 GB/s
> > 3       sequential write-nt 8 bytes             2.8 GB/s
> > 4       sequential write-nt 8 bytes             2.8 GB/s
> > 5       sequential write-nt 8 bytes             2.5 GB/s
> >
> > 1       sequential write 8 bytes + clwb         1.6 GB/s
> > 2       sequential write 8 bytes + clwb         2.4 GB/s
> > 3       sequential write 8 bytes + clwb         1.7 GB/s
> > 4       sequential write 8 bytes + clwb         1.2 GB/s
> > 5       sequential write 8 bytes + clwb         0.8 GB/s
> >
> > For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> > 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> > throughput.
> >
> > The dm-writecache target is singlethreaded (all the copying is done while
> > holding the writecache lock), so it benefits from clwb.
> >
> > Should memcpy_flushcache be changed to write+clwb? Or are there some
> > multithreaded users of memcpy_flushcache that would be hurt by this
> > change?
> 
> Maybe this is asking for a specific memcpy_flushcache_inatomic()
> implementation for your use case, but leave nt-writes for the general
> case?

Yes - I have created this patch that adds a new function 
memcpy_flushcache_clflushopt and makes dm-writecache use it.

Mikulas



From: Mikulas Patocka <mpatocka@redhat.com>

Implement the function memcpy_flushcache_clflushopt which flushes cache
just like memcpy_flushcache - except that it uses cached writes and
explicit cache flushing instead of non-temporal stores.

Explicit cache flushing performs better in some cases (i.e. the
dm-writecache target with block size greater than 512), non-temporal
stores perform better in other cases (mostly multithreaded workloads) - so
we provide these two functions and the user should select which one is
faster for his particular workload.

dm-writecache througput (on real Optane-based persistent memory):
block size	512		1024		2048		4096
movnti		496 MB/s	642 MB/s	725 MB/s	744 MB/s
clflushopt	373 MB/s	688 MB/s	1.1 GB/s	1.2 GB/s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/string_64.h |   10 ++++++++++
 arch/x86/lib/usercopy_64.c       |   32 ++++++++++++++++++++++++++++++++
 drivers/md/dm-writecache.c       |    5 ++++-
 include/linux/string.h           |    6 ++++++
 4 files changed, 52 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h	2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/arch/x86/include/asm/string_64.h	2020-04-17 14:06:35.129999000 +0200
@@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
 	return 0;
 }
 
+/*
+ * In some cases (mostly single-threaded workload), clflushopt is faster
+ * than non-temporal stores. In other situations, non-temporal stores are
+ * faster. So, we provide two functions:
+ *	memcpy_flushcache using non-temporal stores
+ *	memcpy_flushcache_clflushopt using clflushopt
+ * The caller should test which one is faster for the particular workload.
+ */
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
 void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
@@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
 	}
 	__memcpy_flushcache(dst, src, cnt);
 }
+#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
+void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/include/linux/string.h
===================================================================
--- linux-2.6.orig/include/linux/string.h	2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/include/linux/string.h	2020-04-17 14:06:35.129999000 +0200
@@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
 	memcpy(dst, src, cnt);
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
+static inline void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt)
+{
+	memcpy_flushcache(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2020-04-17 14:25:18.569999000 +0200
@@ -199,6 +199,38 @@ void __memcpy_flushcache(void *_dst, con
 }
 EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
+void memcpy_flushcache_clflushopt(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64)) {
+		if (unlikely(!IS_ALIGNED(dest, 64))) {
+			size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
+
+			memcpy((void *) dest, (void *) source, len);
+			clflushopt((void *)dest);
+			dest += len;
+			source += len;
+			size -= len;
+		}
+		while (size >= 64) {
+			memcpy((void *)dest, (void *)source, 64);
+			clflushopt((void *)dest);
+			dest += 64;
+			source += 64;
+			size -= 64;
+		}
+		if (unlikely(size != 0)) {
+			memcpy((void *)dest, (void *)source, size);
+			clflushopt((void *)dest);
+		}
+		return;
+	}
+	memcpy_flushcache((void *)dest, (void *)source, size);
+}
+EXPORT_SYMBOL_GPL(memcpy_flushcache_clflushopt);
+
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
 {
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c	2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c	2020-04-17 14:06:35.129999000 +0200
@@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
 			}
 		} else {
 			flush_dcache_page(bio_page(bio));
-			memcpy_flushcache(data, buf, size);
+			if (likely(size > 512))
+				memcpy_flushcache_clflushopt(data, buf, size);
+			else
+				memcpy_flushcache(data, buf, size);
 		}
 
 		bvec_kunmap_irq(buf, &flags);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-17 12:47             ` [PATCH] x86: introduce memcpy_flushcache_clflushopt Mikulas Patocka
@ 2020-04-17 17:57               ` Dan Williams
  2020-04-17 20:45                 ` Thomas Gleixner
  2020-04-18 13:27               ` [PATCH] x86: introduce memcpy_flushcache_clflushopt David Laight
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Williams @ 2020-04-17 17:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

On Fri, Apr 17, 2020 at 5:47 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Thu, 16 Apr 2020, Dan Williams wrote:
>
> > On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > >
> > >
> > > On Thu, 9 Apr 2020, Mikulas Patocka wrote:
> > >
> > > > With dm-writecache on emulated pmem (with the memmap argument), we get
> > > >
> > > > With the original kernel:
> > > > 8508 - 11378
> > > > real    0m4.960s
> > > > user    0m0.638s
> > > > sys     0m4.312s
> > > >
> > > > With dm-writecache hacked to use cached writes + clflushopt:
> > > > 8505 - 11378
> > > > real    0m4.151s
> > > > user    0m0.560s
> > > > sys     0m3.582s
> > >
> > > I did some multithreaded tests:
> > > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
> > >
> > > And it turns out that for singlethreaded access, write+clwb performs
> > > better, while for multithreaded access, non-temporal stores perform
> > > better.
> > >
> > > 1       sequential write-nt 8 bytes             1.3 GB/s
> > > 2       sequential write-nt 8 bytes             2.5 GB/s
> > > 3       sequential write-nt 8 bytes             2.8 GB/s
> > > 4       sequential write-nt 8 bytes             2.8 GB/s
> > > 5       sequential write-nt 8 bytes             2.5 GB/s
> > >
> > > 1       sequential write 8 bytes + clwb         1.6 GB/s
> > > 2       sequential write 8 bytes + clwb         2.4 GB/s
> > > 3       sequential write 8 bytes + clwb         1.7 GB/s
> > > 4       sequential write 8 bytes + clwb         1.2 GB/s
> > > 5       sequential write 8 bytes + clwb         0.8 GB/s
> > >
> > > For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> > > 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> > > throughput.
> > >
> > > The dm-writecache target is singlethreaded (all the copying is done while
> > > holding the writecache lock), so it benefits from clwb.
> > >
> > > Should memcpy_flushcache be changed to write+clwb? Or are there some
> > > multithreaded users of memcpy_flushcache that would be hurt by this
> > > change?
> >
> > Maybe this is asking for a specific memcpy_flushcache_inatomic()
> > implementation for your use case, but leave nt-writes for the general
> > case?
>
> Yes - I have created this patch that adds a new function
> memcpy_flushcache_clflushopt and makes dm-writecache use it.
>
> Mikulas
>
>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Implement the function memcpy_flushcache_clflushopt which flushes cache
> just like memcpy_flushcache - except that it uses cached writes and
> explicit cache flushing instead of non-temporal stores.
>
> Explicit cache flushing performs better in some cases (i.e. the
> dm-writecache target with block size greater than 512), non-temporal
> stores perform better in other cases (mostly multithreaded workloads) - so
> we provide these two functions and the user should select which one is
> faster for his particular workload.
>
> dm-writecache througput (on real Optane-based persistent memory):
> block size      512             1024            2048            4096
> movnti          496 MB/s        642 MB/s        725 MB/s        744 MB/s
> clflushopt      373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  arch/x86/include/asm/string_64.h |   10 ++++++++++
>  arch/x86/lib/usercopy_64.c       |   32 ++++++++++++++++++++++++++++++++
>  drivers/md/dm-writecache.c       |    5 ++++-
>  include/linux/string.h           |    6 ++++++
>  4 files changed, 52 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/include/asm/string_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/string_64.h     2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/arch/x86/include/asm/string_64.h  2020-04-17 14:06:35.129999000 +0200
> @@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
>         return 0;
>  }
>
> +/*
> + * In some cases (mostly single-threaded workload), clflushopt is faster
> + * than non-temporal stores. In other situations, non-temporal stores are
> + * faster. So, we provide two functions:
> + *     memcpy_flushcache using non-temporal stores
> + *     memcpy_flushcache_clflushopt using clflushopt
> + * The caller should test which one is faster for the particular workload.
> + */
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
>  void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> @@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
>         }
>         __memcpy_flushcache(dst, src, cnt);
>  }
> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
> +void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);

This naming promotes an x86ism and it does not help the caller
understand why 'flushcache_clflushopt' is preferred over 'flushcache'.
The goal of naming it _inatomic() was specifically for the observation
that your driver coordinates atomic access and does not benefit from
the cache friendliness that non-temporal stores afford. That said
_inatomic() is arguably not a good choice either because that refers
to whether the copy is prepared to take a fault or not. What about
_exclusive() or _single()? Anything but _clflushopt() that conveys no
contextual information.

Other than quibbling with the name, and one more comment below, this
looks ok to me.

>  #endif
>
>  #endif /* __KERNEL__ */
> Index: linux-2.6/include/linux/string.h
> ===================================================================
> --- linux-2.6.orig/include/linux/string.h       2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/include/linux/string.h    2020-04-17 14:06:35.129999000 +0200
> @@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
>         memcpy(dst, src, cnt);
>  }
>  #endif
> +#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
> +static inline void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt)
> +{
> +       memcpy_flushcache(dst, src, cnt);
> +}
> +#endif
>  void *memchr_inv(const void *s, int c, size_t n);
>  char *strreplace(char *s, char old, char new);
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c        2020-04-17 14:25:18.569999000 +0200
> @@ -199,6 +199,38 @@ void __memcpy_flushcache(void *_dst, con
>  }
>  EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>
> +void memcpy_flushcache_clflushopt(void *_dst, const void *_src, size_t size)
> +{
> +       unsigned long dest = (unsigned long) _dst;
> +       unsigned long source = (unsigned long) _src;
> +
> +       if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64)) {
> +               if (unlikely(!IS_ALIGNED(dest, 64))) {
> +                       size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
> +
> +                       memcpy((void *) dest, (void *) source, len);
> +                       clflushopt((void *)dest);
> +                       dest += len;
> +                       source += len;
> +                       size -= len;
> +               }
> +               while (size >= 64) {
> +                       memcpy((void *)dest, (void *)source, 64);
> +                       clflushopt((void *)dest);
> +                       dest += 64;
> +                       source += 64;
> +                       size -= 64;
> +               }
> +               if (unlikely(size != 0)) {
> +                       memcpy((void *)dest, (void *)source, size);
> +                       clflushopt((void *)dest);
> +               }
> +               return;
> +       }
> +       memcpy_flushcache((void *)dest, (void *)source, size);
> +}
> +EXPORT_SYMBOL_GPL(memcpy_flushcache_clflushopt);
> +
>  void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
>                 size_t len)
>  {
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c   2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c        2020-04-17 14:06:35.129999000 +0200
> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
>                         }
>                 } else {
>                         flush_dcache_page(bio_page(bio));
> -                       memcpy_flushcache(data, buf, size);
> +                       if (likely(size > 512))

This needs some reference to how this magic number is chosen and how a
future developer might determine whether the value needs to be
adjusted.

Will also need to remember to come back and re-evaluate this once
memcpy_flushcache() is enabled to use movdir64b which might invalidate
the performance advantage you are currently seeing with
cache-allocating-writes plus flushing.

> +                               memcpy_flushcache_clflushopt(data, buf, size);
> +                       else
> +                               memcpy_flushcache(data, buf, size);
>                 }
>
>                 bvec_kunmap_irq(buf, &flags);
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-17 17:57               ` Dan Williams
@ 2020-04-17 20:45                 ` Thomas Gleixner
  2020-04-20 13:47                   ` [PATCH v2] x86: introduce memcpy_flushcache_single Mikulas Patocka
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-04-17 20:45 UTC (permalink / raw)
  To: Dan Williams, Mikulas Patocka
  Cc: Ingo Molnar, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	X86 ML, Linux Kernel Mailing List, device-mapper development

Dan Williams <dan.j.williams@intel.com> writes:
> On Fri, Apr 17, 2020 at 5:47 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
>> +void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);
>
> This naming promotes an x86ism and it does not help the caller
> understand why 'flushcache_clflushopt' is preferred over 'flushcache'.

Right.

> The goal of naming it _inatomic() was specifically for the observation
> that your driver coordinates atomic access and does not benefit from
> the cache friendliness that non-temporal stores afford. That said
> _inatomic() is arguably not a good choice either because that refers
> to whether the copy is prepared to take a fault or not. What about
> _exclusive() or _single()? Anything but _clflushopt() that conveys no
> contextual information.
>
> Other than quibbling with the name, and one more comment below, this
> looks ok to me.
>
>> Index: linux-2.6/drivers/md/dm-writecache.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/md/dm-writecache.c   2020-04-17 14:06:35.139999000 +0200
>> +++ linux-2.6/drivers/md/dm-writecache.c        2020-04-17 14:06:35.129999000 +0200
>> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
>>                         }
>>                 } else {
>>                         flush_dcache_page(bio_page(bio));
>> -                       memcpy_flushcache(data, buf, size);
>> +                       if (likely(size > 512))
>
> This needs some reference to how this magic number is chosen and how a
> future developer might determine whether the value needs to be
> adjusted.

I don't think it's a good idea to make this decision in generic code as
architectures or even CPU models might have different constraints on the
size.

So I'd rather let the architecture implementation decide and make this

                         flush_dcache_page(bio_page(bio));
 -                       memcpy_flushcache(data, buf, size);
 +                       memcpy_flushcache_bikesheddedname(data, buf, size);

and have the default fallback memcpy_flushcache() and let the
architecture sort the size limit and the underlying technology out.

So x86 can use clflushopt or implement it with movdir64b and any other
architecture can provide their own magic soup without changing the
callsite.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-17 12:47             ` [PATCH] x86: introduce memcpy_flushcache_clflushopt Mikulas Patocka
  2020-04-17 17:57               ` Dan Williams
@ 2020-04-18 13:27               ` David Laight
  2020-04-18 15:21                 ` Mikulas Patocka
  1 sibling, 1 reply; 18+ messages in thread
From: David Laight @ 2020-04-18 13:27 UTC (permalink / raw)
  To: 'Mikulas Patocka', Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

From: Mikulas Patocka
> Sent: 17 April 2020 13:47
...
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c	2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c	2020-04-17 14:06:35.129999000 +0200
> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
>  			}
>  		} else {
>  			flush_dcache_page(bio_page(bio));
> -			memcpy_flushcache(data, buf, size);
> +			if (likely(size > 512))
> +				memcpy_flushcache_clflushopt(data, buf, size);
> +			else
> +				memcpy_flushcache(data, buf, size);

Hmmm... have you looked at how long clflush actually takes?
It isn't too bad if you just do a small number, but using it
to flush large buffers can be very slow.

I've an Ivy bridge system where the X-server process requests the
frame buffer be flushed out every 10 seconds (no idea why).
With my 2560x1440 monitor this takes over 3ms.

This really needs a cond_resched() every few clflush instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-18 13:27               ` [PATCH] x86: introduce memcpy_flushcache_clflushopt David Laight
@ 2020-04-18 15:21                 ` Mikulas Patocka
  2020-04-19 17:48                   ` David Laight
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-18 15:21 UTC (permalink / raw)
  To: David Laight
  Cc: Dan Williams, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, X86 ML,
	Linux Kernel Mailing List, device-mapper development



On Sat, 18 Apr 2020, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 17 April 2020 13:47
> ...
> > Index: linux-2.6/drivers/md/dm-writecache.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/dm-writecache.c	2020-04-17 14:06:35.139999000 +0200
> > +++ linux-2.6/drivers/md/dm-writecache.c	2020-04-17 14:06:35.129999000 +0200
> > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> >  			}
> >  		} else {
> >  			flush_dcache_page(bio_page(bio));
> > -			memcpy_flushcache(data, buf, size);
> > +			if (likely(size > 512))
> > +				memcpy_flushcache_clflushopt(data, buf, size);
> > +			else
> > +				memcpy_flushcache(data, buf, size);
> 
> Hmmm... have you looked at how long clflush actually takes?
> It isn't too bad if you just do a small number, but using it
> to flush large buffers can be very slow.

Yes, I have. It's here: 
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt

sequential write 8 + clflush	- 0.3 GB/s on nvdimm
sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
sequential write-nt 8 bytes	- 1.3 GB/s on nvdimm

> I've an Ivy bridge system where the X-server process requests the
> frame buffer be flushed out every 10 seconds (no idea why).
> With my 2560x1440 monitor this takes over 3ms.
> 
> This really needs a cond_resched() every few clflush instructions.
> 
> 	David

AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush 
only allows one outstanding cacle line flush, so it's very slow. 
clflushopt and clwb relaxed this restriction and there can be multiple 
cache-invalidation requests in flight until the user serializes it with 
the sfence instruction.

The patch checks for clflushopt with 
"static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it 
falls back to non-temporal stores.

Mikulas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-18 15:21                 ` Mikulas Patocka
@ 2020-04-19 17:48                   ` David Laight
  2020-04-20  4:49                     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2020-04-19 17:48 UTC (permalink / raw)
  To: 'Mikulas Patocka'
  Cc: Dan Williams, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, X86 ML,
	Linux Kernel Mailing List, device-mapper development

From: Mikulas Patocka
> Sent: 18 April 2020 16:21
> 
> On Sat, 18 Apr 2020, David Laight wrote:
> 
> > From: Mikulas Patocka
> > > Sent: 17 April 2020 13:47
> > ...
> > > Index: linux-2.6/drivers/md/dm-writecache.c
> > > ===================================================================
> > > --- linux-2.6.orig/drivers/md/dm-writecache.c	2020-04-17 14:06:35.139999000 +0200
> > > +++ linux-2.6/drivers/md/dm-writecache.c	2020-04-17 14:06:35.129999000 +0200
> > > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > >  			}
> > >  		} else {
> > >  			flush_dcache_page(bio_page(bio));
> > > -			memcpy_flushcache(data, buf, size);
> > > +			if (likely(size > 512))
> > > +				memcpy_flushcache_clflushopt(data, buf, size);
> > > +			else
> > > +				memcpy_flushcache(data, buf, size);
> >
> > Hmmm... have you looked at how long clflush actually takes?
> > It isn't too bad if you just do a small number, but using it
> > to flush large buffers can be very slow.
> 
> Yes, I have. It's here:
> http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
> 
> sequential write 8 + clflush	- 0.3 GB/s on nvdimm
> sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
> sequential write-nt 8 bytes	- 1.3 GB/s on nvdimm

That table doesn't give enough information to be useful.
The cpu speed, memory speed and transfer lengths are all relevant.

> > I've an Ivy bridge system where the X-server process requests the
> > frame buffer be flushed out every 10 seconds (no idea why).
> > With my 2560x1440 monitor this takes over 3ms.
> >
> > This really needs a cond_resched() every few clflush instructions.
> >
> > 	David
> 
> AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush
> only allows one outstanding cacle line flush, so it's very slow.
> clflushopt and clwb relaxed this restriction and there can be multiple
> cache-invalidation requests in flight until the user serializes it with
> the sfence instruction.

It isn't that simple.
While clflush on Ivybridge is slower than clflushopt on newer processors
both instructions are (relatively) fast for something like 16 or 32
iterations. After that they get much slower.
I can't remember where I found the relevant figures, even the ones I
found didn't show how large the transfers needed to be before the bytes/sec
became constant.

> The patch checks for clflushopt with
> "static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it
> falls back to non-temporal stores.

Ok, I was expecting you'd be falling back to clflush first.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt
  2020-04-19 17:48                   ` David Laight
@ 2020-04-20  4:49                     ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2020-04-20  4:49 UTC (permalink / raw)
  To: David Laight
  Cc: Mikulas Patocka, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, X86 ML,
	Linux Kernel Mailing List, device-mapper development

On Sun, Apr 19, 2020 at 10:49 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Mikulas Patocka
> > Sent: 18 April 2020 16:21
> >
> > On Sat, 18 Apr 2020, David Laight wrote:
> >
> > > From: Mikulas Patocka
> > > > Sent: 17 April 2020 13:47
> > > ...
> > > > Index: linux-2.6/drivers/md/dm-writecache.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/drivers/md/dm-writecache.c     2020-04-17 14:06:35.139999000 +0200
> > > > +++ linux-2.6/drivers/md/dm-writecache.c  2020-04-17 14:06:35.129999000 +0200
> > > > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > > >                   }
> > > >           } else {
> > > >                   flush_dcache_page(bio_page(bio));
> > > > -                 memcpy_flushcache(data, buf, size);
> > > > +                 if (likely(size > 512))
> > > > +                         memcpy_flushcache_clflushopt(data, buf, size);
> > > > +                 else
> > > > +                         memcpy_flushcache(data, buf, size);
> > >
> > > Hmmm... have you looked at how long clflush actually takes?
> > > It isn't too bad if you just do a small number, but using it
> > > to flush large buffers can be very slow.
> >
> > Yes, I have. It's here:
> > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
> >
> > sequential write 8 + clflush  - 0.3 GB/s on nvdimm
> > sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
> > sequential write-nt 8 bytes   - 1.3 GB/s on nvdimm
>
> That table doesn't give enough information to be useful.
> The cpu speed, memory speed and transfer lengths are all relevant.
>
> > > I've an Ivy bridge system where the X-server process requests the
> > > frame buffer be flushed out every 10 seconds (no idea why).
> > > With my 2560x1440 monitor this takes over 3ms.
> > >
> > > This really needs a cond_resched() every few clflush instructions.
> > >
> > >     David
> >
> > AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush
> > only allows one outstanding cacle line flush, so it's very slow.
> > clflushopt and clwb relaxed this restriction and there can be multiple
> > cache-invalidation requests in flight until the user serializes it with
> > the sfence instruction.
>
> It isn't that simple.
> While clflush on Ivybridge is slower than clflushopt on newer processors
> both instructions are (relatively) fast for something like 16 or 32
> iterations. After that they get much slower.
> I can't remember where I found the relevant figures, even the ones I
> found didn't show how large the transfers needed to be before the bytes/sec
> became constant.
>
> > The patch checks for clflushopt with
> > "static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it
> > falls back to non-temporal stores.
>
> Ok, I was expecting you'd be falling back to clflush first.

clflush is a serializing instruction, clflushopt and non-temporal
stores are not.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] x86: introduce memcpy_flushcache_single
  2020-04-17 20:45                 ` Thomas Gleixner
@ 2020-04-20 13:47                   ` Mikulas Patocka
  2020-04-21 18:43                     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2020-04-20 13:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dan Williams, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development



On Fri, 17 Apr 2020, Thomas Gleixner wrote:

> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > The goal of naming it _inatomic() was specifically for the observation
> > that your driver coordinates atomic access and does not benefit from
> > the cache friendliness that non-temporal stores afford. That said
> > _inatomic() is arguably not a good choice either because that refers
> > to whether the copy is prepared to take a fault or not. What about
> > _exclusive() or _single()? Anything but _clflushopt() that conveys no
> > contextual information.

OK. I renamed it to memcpy_flushcache_single

> > Other than quibbling with the name, and one more comment below, this
> > looks ok to me.
> >
> >> Index: linux-2.6/drivers/md/dm-writecache.c
> >> ===================================================================
> >> --- linux-2.6.orig/drivers/md/dm-writecache.c   2020-04-17 14:06:35.139999000 +0200
> >> +++ linux-2.6/drivers/md/dm-writecache.c        2020-04-17 14:06:35.129999000 +0200
> >> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> >>                         }
> >>                 } else {
> >>                         flush_dcache_page(bio_page(bio));
> >> -                       memcpy_flushcache(data, buf, size);
> >> +                       if (likely(size > 512))
> >
> > This needs some reference to how this magic number is chosen and how a
> > future developer might determine whether the value needs to be
> > adjusted.
> 
> I don't think it's a good idea to make this decision in generic code as
> architectures or even CPU models might have different constraints on the
> size.
> 
> So I'd rather let the architecture implementation decide and make this
> 
>                          flush_dcache_page(bio_page(bio));
>  -                       memcpy_flushcache(data, buf, size);
>  +                       memcpy_flushcache_bikesheddedname(data, buf, size);
> 
> and have the default fallback memcpy_flushcache() and let the
> architecture sort the size limit and the underlying technology out.
> 
> So x86 can use clflushopt or implement it with movdir64b and any other
> architecture can provide their own magic soup without changing the
> callsite.
> 
> Thanks,
> 
>         tglx

OK - so I moved the decision to memcpy_flushcache_single and I added a 
comment that explains the magic number.

Mikulas




From: Mikulas Patocka <mpatocka@redhat.com>

Implement the function memcpy_flushcache_single which flushes cache just
like memcpy_flushcache - except that it uses cached writes and explicit
cache flushing instead of non-temporal stores.

Explicit cache flushing performs better in singlethreaded cases (i.e. the
dm-writecache target with block size greater than 512), non-temporal
stores perform better in other cases (mostly multithreaded workloads) - so
we provide these two functions and the user should select which one is
faster for his particular workload.

dm-writecache througput (on real Optane-based persistent memory):
block size	512		1024		2048		4096
movnti		496 MB/s	642 MB/s	725 MB/s	744 MB/s
clflushopt	373 MB/s	688 MB/s	1.1 GB/s	1.2 GB/s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/string_64.h |   10 ++++++++
 arch/x86/lib/usercopy_64.c       |   46 +++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-writecache.c       |    2 -
 include/linux/string.h           |    6 +++++
 4 files changed, 63 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h	2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/arch/x86/include/asm/string_64.h	2020-04-20 15:31:46.929999000 +0200
@@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
 	return 0;
 }
 
+/*
+ * In some cases (mostly single-threaded workload), clflushopt is faster
+ * than non-temporal stores. In other situations, non-temporal stores are
+ * faster. So, we provide two functions:
+ *	memcpy_flushcache using non-temporal stores
+ *	memcpy_flushcache_single using clflushopt
+ * The caller should test which one is faster for the particular workload.
+ */
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
 void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
@@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
 	}
 	__memcpy_flushcache(dst, src, cnt);
 }
+#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
+void memcpy_flushcache_single(void *dst, const void *src, size_t cnt);
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/include/linux/string.h
===================================================================
--- linux-2.6.orig/include/linux/string.h	2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/include/linux/string.h	2020-04-20 15:31:46.929999000 +0200
@@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
 	memcpy(dst, src, cnt);
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
+static inline void memcpy_flushcache_single(void *dst, const void *src, size_t cnt)
+{
+	memcpy_flushcache(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2020-04-20 15:38:13.159999000 +0200
@@ -199,6 +199,52 @@ void __memcpy_flushcache(void *_dst, con
 }
 EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
+void memcpy_flushcache_single(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/*
+	 * dm-writecache througput (on real Optane-based persistent memory):
+	 * measured with dd:
+	 *
+	 * block size	512		1024		2048		4096
+	 * movnti	496 MB/s	642 MB/s	725 MB/s	744 MB/s
+	 * clflushopt	373 MB/s	688 MB/s	1.1 GB/s	1.2 GB/s
+	 *
+	 * We see that movnti performs better for 512-byte blocks, and
+	 * clflushopt performs better for 1024-byte and larger blocks. So, we
+	 * prefer clflushopt for sizes >= 768.
+	 */
+
+	if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64) &&
+	    likely(size >= 768)) {
+		if (unlikely(!IS_ALIGNED(dest, 64))) {
+			size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
+
+			memcpy((void *) dest, (void *) source, len);
+			clflushopt((void *)dest);
+			dest += len;
+			source += len;
+			size -= len;
+		}
+		do {
+			memcpy((void *)dest, (void *)source, 64);
+			clflushopt((void *)dest);
+			dest += 64;
+			source += 64;
+			size -= 64;
+		} while (size >= 64) 
+		if (unlikely(size != 0)) {
+			memcpy((void *)dest, (void *)source, size);
+			clflushopt((void *)dest);
+		}
+		return;
+	}
+	memcpy_flushcache((void *)dest, (void *)source, size);
+}
+EXPORT_SYMBOL_GPL(memcpy_flushcache_single);
+
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
 {
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c	2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c	2020-04-20 15:32:35.549999000 +0200
@@ -1166,7 +1166,7 @@ static void bio_copy_block(struct dm_wri
 			}
 		} else {
 			flush_dcache_page(bio_page(bio));
-			memcpy_flushcache(data, buf, size);
+			memcpy_flushcache_single(data, buf, size);
 		}
 
 		bvec_kunmap_irq(buf, &flags);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: introduce memcpy_flushcache_single
  2020-04-20 13:47                   ` [PATCH v2] x86: introduce memcpy_flushcache_single Mikulas Patocka
@ 2020-04-21 18:43                     ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2020-04-21 18:43 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Peter Zijlstra, X86 ML, Linux Kernel Mailing List,
	device-mapper development

On Mon, Apr 20, 2020 at 6:48 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Fri, 17 Apr 2020, Thomas Gleixner wrote:
>
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> > > The goal of naming it _inatomic() was specifically for the observation
> > > that your driver coordinates atomic access and does not benefit from
> > > the cache friendliness that non-temporal stores afford. That said
> > > _inatomic() is arguably not a good choice either because that refers
> > > to whether the copy is prepared to take a fault or not. What about
> > > _exclusive() or _single()? Anything but _clflushopt() that conveys no
> > > contextual information.
>
> OK. I renamed it to memcpy_flushcache_single
>
> > > Other than quibbling with the name, and one more comment below, this
> > > looks ok to me.
> > >
> > >> Index: linux-2.6/drivers/md/dm-writecache.c
> > >> ===================================================================
> > >> --- linux-2.6.orig/drivers/md/dm-writecache.c   2020-04-17 14:06:35.139999000 +0200
> > >> +++ linux-2.6/drivers/md/dm-writecache.c        2020-04-17 14:06:35.129999000 +0200
> > >> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > >>                         }
> > >>                 } else {
> > >>                         flush_dcache_page(bio_page(bio));
> > >> -                       memcpy_flushcache(data, buf, size);
> > >> +                       if (likely(size > 512))
> > >
> > > This needs some reference to how this magic number is chosen and how a
> > > future developer might determine whether the value needs to be
> > > adjusted.
> >
> > I don't think it's a good idea to make this decision in generic code as
> > architectures or even CPU models might have different constraints on the
> > size.
> >
> > So I'd rather let the architecture implementation decide and make this
> >
> >                          flush_dcache_page(bio_page(bio));
> >  -                       memcpy_flushcache(data, buf, size);
> >  +                       memcpy_flushcache_bikesheddedname(data, buf, size);
> >
> > and have the default fallback memcpy_flushcache() and let the
> > architecture sort the size limit and the underlying technology out.
> >
> > So x86 can use clflushopt or implement it with movdir64b and any other
> > architecture can provide their own magic soup without changing the
> > callsite.
> >
> > Thanks,
> >
> >         tglx
>
> OK - so I moved the decision to memcpy_flushcache_single and I added a
> comment that explains the magic number.
>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Implement the function memcpy_flushcache_single which flushes cache just
> like memcpy_flushcache - except that it uses cached writes and explicit
> cache flushing instead of non-temporal stores.
>
> Explicit cache flushing performs better in singlethreaded cases (i.e. the
> dm-writecache target with block size greater than 512), non-temporal
> stores perform better in other cases (mostly multithreaded workloads) - so
> we provide these two functions and the user should select which one is
> faster for his particular workload.

I would mention that dm-writecache is choosing to use
memcpy_flushcache_single() because it is regularly invoked under a
lock.

"The dm-writecache target is singlethreaded (all the copying is done
while holding the writecache lock), so it benefits from clwb." [1]

[1]: http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com

>
> dm-writecache througput (on real Optane-based persistent memory):
> block size      512             1024            2048            4096
> movnti          496 MB/s        642 MB/s        725 MB/s        744 MB/s
> clflushopt      373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  arch/x86/include/asm/string_64.h |   10 ++++++++
>  arch/x86/lib/usercopy_64.c       |   46 +++++++++++++++++++++++++++++++++++++++
>  drivers/md/dm-writecache.c       |    2 -
>  include/linux/string.h           |    6 +++++
>  4 files changed, 63 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/include/asm/string_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/string_64.h     2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/arch/x86/include/asm/string_64.h  2020-04-20 15:31:46.929999000 +0200
> @@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
>         return 0;
>  }
>
> +/*
> + * In some cases (mostly single-threaded workload), clflushopt is faster
> + * than non-temporal stores. In other situations, non-temporal stores are
> + * faster. So, we provide two functions:
> + *     memcpy_flushcache using non-temporal stores
> + *     memcpy_flushcache_single using clflushopt
> + * The caller should test which one is faster for the particular workload.
> + */
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
>  void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> @@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
>         }
>         __memcpy_flushcache(dst, src, cnt);
>  }
> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
> +void memcpy_flushcache_single(void *dst, const void *src, size_t cnt);
>  #endif
>
>  #endif /* __KERNEL__ */
> Index: linux-2.6/include/linux/string.h
> ===================================================================
> --- linux-2.6.orig/include/linux/string.h       2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/include/linux/string.h    2020-04-20 15:31:46.929999000 +0200
> @@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
>         memcpy(dst, src, cnt);
>  }
>  #endif
> +#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
> +static inline void memcpy_flushcache_single(void *dst, const void *src, size_t cnt)
> +{
> +       memcpy_flushcache(dst, src, cnt);
> +}
> +#endif
>  void *memchr_inv(const void *s, int c, size_t n);
>  char *strreplace(char *s, char old, char new);
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c        2020-04-20 15:38:13.159999000 +0200
> @@ -199,6 +199,52 @@ void __memcpy_flushcache(void *_dst, con
>  }
>  EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>
> +void memcpy_flushcache_single(void *_dst, const void *_src, size_t size)
> +{
> +       unsigned long dest = (unsigned long) _dst;
> +       unsigned long source = (unsigned long) _src;
> +
> +       /*
> +        * dm-writecache througput (on real Optane-based persistent memory):
> +        * measured with dd:

Why mention Optane? There are several types of persistent memory.
Typical persistent memory to date behaves like DDR because it is
battery backed. So if you're going to mention the memory type I would
also include the DDR details.

At a minimum include the lore link in the changelog to the wider
analysis you contributed on the mailing list.

> +        *
> +        * block size   512             1024            2048            4096
> +        * movnti       496 MB/s        642 MB/s        725 MB/s        744 MB/s
> +        * clflushopt   373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
> +        *
> +        * We see that movnti performs better for 512-byte blocks, and
> +        * clflushopt performs better for 1024-byte and larger blocks. So, we
> +        * prefer clflushopt for sizes >= 768.
> +        */
> +
> +       if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64) &&
> +           likely(size >= 768)) {
> +               if (unlikely(!IS_ALIGNED(dest, 64))) {
> +                       size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
> +
> +                       memcpy((void *) dest, (void *) source, len);
> +                       clflushopt((void *)dest);
> +                       dest += len;
> +                       source += len;
> +                       size -= len;
> +               }
> +               do {
> +                       memcpy((void *)dest, (void *)source, 64);
> +                       clflushopt((void *)dest);
> +                       dest += 64;
> +                       source += 64;
> +                       size -= 64;
> +               } while (size >= 64)
> +               if (unlikely(size != 0)) {
> +                       memcpy((void *)dest, (void *)source, size);
> +                       clflushopt((void *)dest);
> +               }
> +               return;
> +       }
> +       memcpy_flushcache((void *)dest, (void *)source, size);
> +}
> +EXPORT_SYMBOL_GPL(memcpy_flushcache_single);
> +
>  void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
>                 size_t len)
>  {
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c   2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c        2020-04-20 15:32:35.549999000 +0200
> @@ -1166,7 +1166,7 @@ static void bio_copy_block(struct dm_wri
>                         }
>                 } else {
>                         flush_dcache_page(bio_page(bio));
> -                       memcpy_flushcache(data, buf, size);
> +                       memcpy_flushcache_single(data, buf, size);
>                 }
>
>                 bvec_kunmap_irq(buf, &flags);
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-04-21 18:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-07 15:01 [PATCH] memcpy_flushcache: use cache flusing for larger lengths Mikulas Patocka
2020-04-07 16:09 ` Andy Lutomirski
2020-04-07 16:33   ` Mikulas Patocka
2020-04-07 17:52 ` Dan Williams
2020-04-08 18:54   ` Mikulas Patocka
2020-04-08 19:29     ` Dan Williams
2020-04-09 14:36       ` Mikulas Patocka
2020-04-16  8:24         ` Mikulas Patocka
2020-04-16 18:28           ` Dan Williams
2020-04-17 12:47             ` [PATCH] x86: introduce memcpy_flushcache_clflushopt Mikulas Patocka
2020-04-17 17:57               ` Dan Williams
2020-04-17 20:45                 ` Thomas Gleixner
2020-04-20 13:47                   ` [PATCH v2] x86: introduce memcpy_flushcache_single Mikulas Patocka
2020-04-21 18:43                     ` Dan Williams
2020-04-18 13:27               ` [PATCH] x86: introduce memcpy_flushcache_clflushopt David Laight
2020-04-18 15:21                 ` Mikulas Patocka
2020-04-19 17:48                   ` David Laight
2020-04-20  4:49                     ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).