From: Mikulas Patocka <mpatocka@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
"H. Peter Anvin" <hpa@zytor.com>,
Peter Zijlstra <peterz@infradead.org>,
x86@kernel.org, Dan Williams <dan.j.williams@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
dm-devel@redhat.com
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
Date: Tue, 7 Apr 2020 12:33:32 -0400 (EDT) [thread overview]
Message-ID: <alpine.LRH.2.02.2004071223130.7791@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <583AD128-5B10-4414-A35B-FEACF30B7C5A@amacapital.net>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5746 bytes --]
On Tue, 7 Apr 2020, Andy Lutomirski wrote:
>
> > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > [ resending this to x86 maintainers ]
> >
> > Hi
> >
> > I tested performance of various methods how to write to optane-based
> > persistent memory, and found out that non-temporal stores achieve
> > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > or clwb achieve throughput 1.6 GB/s.
> >
> > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > stores + clflushopt and it improved performance of the dm-writecache
> > target significantly:
> >
> > dm-writecache throughput:
> > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > writecache block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > For block size 512, movnti works better, for larger block sizes,
> > clflushopt is better.
> >
> > I was also testing the novafs filesystem, it is not upstream, but it
> > benefitted from similar change in __memcpy_flushcache and
> > __copy_user_nocache:
> > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> >
> >
> > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > performance.
> >
> > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > memcpy_flushcache and move this logic there? Or should I modify the
> > dm-writecache target directly to use clflushopt with no change to the
> > architecture-specific code?
> >
> > Mikulas
> >
> >
> >
> >
> > From: Mikulas Patocka <mpatocka@redhat.com>
> >
> > I tested dm-writecache performance on a machine with Optane nvdimm and it
> > turned out that for larger writes, cached stores + cache flushing perform
> > better than non-temporal stores. This is the throughput of dm-writecache
> > measured with this command:
> > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> >
> > block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > We can see that for smaller block, movnti performs better, but for larger
> > blocks, clflushopt has better performance.
> >
> > This patch changes the function __memcpy_flushcache accordingly, so that
> > with size >= 768 it performs cached stores and cache flushing. Note that
> > we must not use the new branch if the CPU doesn't have clflushopt - in
> > that case, the kernel would use inefficient "clflush" instruction that has
> > very bad performance.
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >
> > ---
> > arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 36 insertions(+)
> >
> > Index: linux-2.6/arch/x86/lib/usercopy_64.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
> > +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
> > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> > return;
> > }
> >
> > + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> > + while (!IS_ALIGNED(dest, 64)) {
> > + asm("movq (%0), %%r8\n"
> > + "movnti %%r8, (%1)\n"
> > + :: "r" (source), "r" (dest)
> > + : "memory", "r8");
> > + dest += 8;
> > + source += 8;
> > + size -= 8;
> > + }
> > + do {
> > + asm("movq (%0), %%r8\n"
> > + "movq 8(%0), %%r9\n"
> > + "movq 16(%0), %%r10\n"
> > + "movq 24(%0), %%r11\n"
> > + "movq %%r8, (%1)\n"
> > + "movq %%r9, 8(%1)\n"
> > + "movq %%r10, 16(%1)\n"
> > + "movq %%r11, 24(%1)\n"
> > + "movq 32(%0), %%r8\n"
> > + "movq 40(%0), %%r9\n"
> > + "movq 48(%0), %%r10\n"
> > + "movq 56(%0), %%r11\n"
> > + "movq %%r8, 32(%1)\n"
> > + "movq %%r9, 40(%1)\n"
> > + "movq %%r10, 48(%1)\n"
> > + "movq %%r11, 56(%1)\n"
> > + :: "r" (source), "r" (dest)
> > + : "memory", "r8", "r9", "r10", "r11");
>
> Does this actually work better than the corresponding C code?
>
> Also, that memory clobber probably isn’t doing your code generation any
> favors. Experimentally, you have the constraints wrong. An “r”
The existing "movnti" loop uses exactly the same constraints (and the
"memory" clobber).
> constraint doesn’t tell GCC that you are dereferencing the pointer.
> You need to use “m” with a correctly-sized type.
But you would use
"=m"(*(char *)dest),"=m"(*((char *)dest + 8)),"=m"((char *)dest + 16))...
and so on, until you run out of argument numbers.
> But I bet plain C is at least as good.
I tried to replace it with
memcpy((void *)dest, (void *)src, 64);
The compiler inlined the memcpy function into 8 loads and 8 stores.
However, the whole function __memcpy_flushcache consumed one more saved
register and the machine code was a few bytes longer.
Mikulas
next prev parent reply other threads:[~2020-04-07 16:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-07 15:01 [PATCH] memcpy_flushcache: use cache flusing for larger lengths Mikulas Patocka
2020-04-07 16:09 ` Andy Lutomirski
2020-04-07 16:33 ` Mikulas Patocka [this message]
2020-04-07 17:52 ` Dan Williams
2020-04-08 18:54 ` Mikulas Patocka
2020-04-08 19:29 ` Dan Williams
2020-04-09 14:36 ` Mikulas Patocka
2020-04-16 8:24 ` Mikulas Patocka
2020-04-16 18:28 ` Dan Williams
2020-04-17 12:47 ` [PATCH] x86: introduce memcpy_flushcache_clflushopt Mikulas Patocka
2020-04-17 17:57 ` Dan Williams
2020-04-17 20:45 ` Thomas Gleixner
2020-04-20 13:47 ` [PATCH v2] x86: introduce memcpy_flushcache_single Mikulas Patocka
2020-04-21 18:43 ` Dan Williams
2020-04-18 13:27 ` [PATCH] x86: introduce memcpy_flushcache_clflushopt David Laight
2020-04-18 15:21 ` Mikulas Patocka
2020-04-19 17:48 ` David Laight
2020-04-20 4:49 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LRH.2.02.2004071223130.7791@file01.intranet.prod.int.rdu2.redhat.com \
--to=mpatocka@redhat.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dm-devel@redhat.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).