On Tue, 7 Apr 2020, Andy Lutomirski wrote: > > > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka wrote: > > > > [ resending this to x86 maintainers ] > > > > Hi > > > > I tested performance of various methods how to write to optane-based > > persistent memory, and found out that non-temporal stores achieve > > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt > > or clwb achieve throughput 1.6 GB/s. > > > > memcpy_flushcache uses non-temporal stores, I modified it to use cached > > stores + clflushopt and it improved performance of the dm-writecache > > target significantly: > > > > dm-writecache throughput: > > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct) > > writecache block size 512 1024 2048 4096 > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > > > For block size 512, movnti works better, for larger block sizes, > > clflushopt is better. > > > > I was also testing the novafs filesystem, it is not upstream, but it > > benefitted from similar change in __memcpy_flushcache and > > __copy_user_nocache: > > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s > > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s > > > > > > I submit this patch for __memcpy_flushcache that improves dm-writecache > > performance. > > > > Other ideas - should we introduce memcpy_to_pmem instead of modifying > > memcpy_flushcache and move this logic there? Or should I modify the > > dm-writecache target directly to use clflushopt with no change to the > > architecture-specific code? > > > > Mikulas > > > > > > > > > > From: Mikulas Patocka > > > > I tested dm-writecache performance on a machine with Optane nvdimm and it > > turned out that for larger writes, cached stores + cache flushing perform > > better than non-temporal stores. This is the throughput of dm-writecache > > measured with this command: > > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct > > > > block size 512 1024 2048 4096 > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > > > We can see that for smaller block, movnti performs better, but for larger > > blocks, clflushopt has better performance. > > > > This patch changes the function __memcpy_flushcache accordingly, so that > > with size >= 768 it performs cached stores and cache flushing. Note that > > we must not use the new branch if the CPU doesn't have clflushopt - in > > that case, the kernel would use inefficient "clflush" instruction that has > > very bad performance. > > > > Signed-off-by: Mikulas Patocka > > > > --- > > arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++ > > 1 file changed, 36 insertions(+) > > > > Index: linux-2.6/arch/x86/lib/usercopy_64.c > > =================================================================== > > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400 > > +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400 > > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con > > return; > > } > > > > + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) { > > + while (!IS_ALIGNED(dest, 64)) { > > + asm("movq (%0), %%r8\n" > > + "movnti %%r8, (%1)\n" > > + :: "r" (source), "r" (dest) > > + : "memory", "r8"); > > + dest += 8; > > + source += 8; > > + size -= 8; > > + } > > + do { > > + asm("movq (%0), %%r8\n" > > + "movq 8(%0), %%r9\n" > > + "movq 16(%0), %%r10\n" > > + "movq 24(%0), %%r11\n" > > + "movq %%r8, (%1)\n" > > + "movq %%r9, 8(%1)\n" > > + "movq %%r10, 16(%1)\n" > > + "movq %%r11, 24(%1)\n" > > + "movq 32(%0), %%r8\n" > > + "movq 40(%0), %%r9\n" > > + "movq 48(%0), %%r10\n" > > + "movq 56(%0), %%r11\n" > > + "movq %%r8, 32(%1)\n" > > + "movq %%r9, 40(%1)\n" > > + "movq %%r10, 48(%1)\n" > > + "movq %%r11, 56(%1)\n" > > + :: "r" (source), "r" (dest) > > + : "memory", "r8", "r9", "r10", "r11"); > > Does this actually work better than the corresponding C code? > > Also, that memory clobber probably isn’t doing your code generation any > favors. Experimentally, you have the constraints wrong. An “r” The existing "movnti" loop uses exactly the same constraints (and the "memory" clobber). > constraint doesn’t tell GCC that you are dereferencing the pointer. > You need to use “m” with a correctly-sized type. But you would use "=m"(*(char *)dest),"=m"(*((char *)dest + 8)),"=m"((char *)dest + 16))... and so on, until you run out of argument numbers. > But I bet plain C is at least as good. I tried to replace it with memcpy((void *)dest, (void *)src, 64); The compiler inlined the memcpy function into 8 loads and 8 stores. However, the whole function __memcpy_flushcache consumed one more saved register and the machine code was a few bytes longer. Mikulas