From: Dan Williams <dan.j.williams@intel.com> To: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <msnitzer@redhat.com>, "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>, "dm-devel@redhat.com" <dm-devel@redhat.com>, X86 ML <x86@kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: Re: [PATCH v2] memcpy_flushcache: use cache flusing for larger lengths Date: Tue, 31 Mar 2020 14:19:32 -0700 [thread overview] Message-ID: <CAPcyv4ijR185RLmtT+A+WZxJs309qPfdqj5eUDEkMgFbxsV+uw@mail.gmail.com> (raw) In-Reply-To: <alpine.LRH.2.02.2003310709090.2117@file01.intranet.prod.int.rdu2.redhat.com> [ add x86 and LKML ] On Tue, Mar 31, 2020 at 5:27 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote: > > > > > > > > -----Original Message----- > > > From: Mikulas Patocka <mpatocka@redhat.com> > > > Sent: Monday, March 30, 2020 6:32 AM > > > To: Dan Williams <dan.j.williams@intel.com>; Vishal Verma > > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Ira > > > Weiny <ira.weiny@intel.com>; Mike Snitzer <msnitzer@redhat.com> > > > Cc: linux-nvdimm@lists.01.org; dm-devel@redhat.com > > > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger > > > lengths > > > > > > I tested dm-writecache performance on a machine with Optane nvdimm > > > and it turned out that for larger writes, cached stores + cache > > > flushing perform better than non-temporal stores. This is the > > > throughput of dm- writecache measured with this command: > > > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct > > > > > > block size 512 1024 2048 4096 > > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > > > > > We can see that for smaller block, movnti performs better, but for > > > larger blocks, clflushopt has better performance. > > > > There are other interactions to consider... see threads from the last > > few years on the linux-nvdimm list. > > dm-writecache is the only linux driver that uses memcpy_flushcache on > persistent memory. There is also the btt driver, it uses the "do_io" > method to write to persistent memory and I don't know where this method > comes from. > > Anyway, if patching memcpy_flushcache conflicts with something else, we > should introduce memcpy_flushcache_to_pmem. > > > For example, software generally expects that read()s take a long time and > > avoids re-reading from disk; the normal pattern is to hold the data in > > memory and read it from there. By using normal stores, CPU caches end up > > holding a bunch of persistent memory data that is probably not going to > > be read again any time soon, bumping out more useful data. In contrast, > > movnti avoids filling the CPU caches. > > But if I write one cacheline and flush it immediatelly, it would consume > just one associative entry in the cache. > > > Another option is the AVX vmovntdq instruction (if available), the > > most recent of which does 64-byte (cache line) sized transfers to > > zmm registers. There's a hefty context switching overhead (e.g., > > 304 clocks), and the CPU often runs AVX instructions at a slower > > clock frequency, so it's hard to judge when it's worthwhile. > > The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good > as 8, 16 or 32-bytes writes. > ram nvdimm > sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s > sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s > sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s > > With cached writes (where each cache line is immediatelly followed by clwb > or clflushopt), 8, 16 or 32-byte write performs better than non-temporal > stores and avx512 performs worse. > > sequential write 8 + clwb 5.1 GB/s 1.6 GB/s > sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s > sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s > sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s This is indeed compelling straight-line data. My concern, similar to Robert's, is what it does to the rest of the system. In addition to increasing cache pollution, which I agree is difficult to quantify, it may also increase read-for-ownership traffic. Could you collect 'perf stat' for this clwb vs nt comparison to check if any of this incidental overhead effect shows up in the numbers? Here is a 'perf stat' line that might capture that. perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses -r 5 $benchmark In both cases nt and explicit clwb there's nothing that prevents the dirty-cacheline, or the fill buffer from being written-back / flushed before the full line is populated and maybe you are hitting that scenario differently with the two approaches? I did not immediately see a perf counter for events like this. Going forward I think this gets better with the movdir64b instruction because that can guarantee full-line-sized store-buffer writes. Maybe the perf data can help make a decision about whether we go with your patch in the near term? > > > > In user space, glibc faces similar choices for its memcpy() functions; > > glibc memcpy() uses non-temporal stores for transfers > 75% of the > > L3 cache size divided by the number of cores. For example, with > > glibc-2.216-16.fc27 (August 2017), on a Broadwell system with > > E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used > > for memcpy()s over 36 MiB. > > BTW. what does glibc do with reads? Does it flush them from the cache > after they are consumed? > > AFAIK glibc doesn't support persistent memory - i.e. there is no function > that flushes data and the user has to use inline assembly for that. Yes, and I don't know of any copy routines that try to limit the cache pollution of pulling the source data for a copy, only the destination. > > It'd be nice if glibc, PMDK, and the kernel used the same algorithms. Yes, it would. Although I think PMDK would make a different decision than the kernel when optimizing for highest bandwidth for the local application vs bandwidth efficiency across all applications. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: Mikulas Patocka <mpatocka@redhat.com> Cc: "Elliott, Robert (Servers)" <elliott@hpe.com>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Mike Snitzer <msnitzer@redhat.com>, "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>, "dm-devel@redhat.com" <dm-devel@redhat.com>, X86 ML <x86@kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: Re: [PATCH v2] memcpy_flushcache: use cache flusing for larger lengths Date: Tue, 31 Mar 2020 14:19:32 -0700 [thread overview] Message-ID: <CAPcyv4ijR185RLmtT+A+WZxJs309qPfdqj5eUDEkMgFbxsV+uw@mail.gmail.com> (raw) In-Reply-To: <alpine.LRH.2.02.2003310709090.2117@file01.intranet.prod.int.rdu2.redhat.com> [ add x86 and LKML ] On Tue, Mar 31, 2020 at 5:27 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote: > > > > > > > > -----Original Message----- > > > From: Mikulas Patocka <mpatocka@redhat.com> > > > Sent: Monday, March 30, 2020 6:32 AM > > > To: Dan Williams <dan.j.williams@intel.com>; Vishal Verma > > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Ira > > > Weiny <ira.weiny@intel.com>; Mike Snitzer <msnitzer@redhat.com> > > > Cc: linux-nvdimm@lists.01.org; dm-devel@redhat.com > > > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger > > > lengths > > > > > > I tested dm-writecache performance on a machine with Optane nvdimm > > > and it turned out that for larger writes, cached stores + cache > > > flushing perform better than non-temporal stores. This is the > > > throughput of dm- writecache measured with this command: > > > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct > > > > > > block size 512 1024 2048 4096 > > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > > > > > We can see that for smaller block, movnti performs better, but for > > > larger blocks, clflushopt has better performance. > > > > There are other interactions to consider... see threads from the last > > few years on the linux-nvdimm list. > > dm-writecache is the only linux driver that uses memcpy_flushcache on > persistent memory. There is also the btt driver, it uses the "do_io" > method to write to persistent memory and I don't know where this method > comes from. > > Anyway, if patching memcpy_flushcache conflicts with something else, we > should introduce memcpy_flushcache_to_pmem. > > > For example, software generally expects that read()s take a long time and > > avoids re-reading from disk; the normal pattern is to hold the data in > > memory and read it from there. By using normal stores, CPU caches end up > > holding a bunch of persistent memory data that is probably not going to > > be read again any time soon, bumping out more useful data. In contrast, > > movnti avoids filling the CPU caches. > > But if I write one cacheline and flush it immediatelly, it would consume > just one associative entry in the cache. > > > Another option is the AVX vmovntdq instruction (if available), the > > most recent of which does 64-byte (cache line) sized transfers to > > zmm registers. There's a hefty context switching overhead (e.g., > > 304 clocks), and the CPU often runs AVX instructions at a slower > > clock frequency, so it's hard to judge when it's worthwhile. > > The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good > as 8, 16 or 32-bytes writes. > ram nvdimm > sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s > sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s > sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s > > With cached writes (where each cache line is immediatelly followed by clwb > or clflushopt), 8, 16 or 32-byte write performs better than non-temporal > stores and avx512 performs worse. > > sequential write 8 + clwb 5.1 GB/s 1.6 GB/s > sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s > sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s > sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s This is indeed compelling straight-line data. My concern, similar to Robert's, is what it does to the rest of the system. In addition to increasing cache pollution, which I agree is difficult to quantify, it may also increase read-for-ownership traffic. Could you collect 'perf stat' for this clwb vs nt comparison to check if any of this incidental overhead effect shows up in the numbers? Here is a 'perf stat' line that might capture that. perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses -r 5 $benchmark In both cases nt and explicit clwb there's nothing that prevents the dirty-cacheline, or the fill buffer from being written-back / flushed before the full line is populated and maybe you are hitting that scenario differently with the two approaches? I did not immediately see a perf counter for events like this. Going forward I think this gets better with the movdir64b instruction because that can guarantee full-line-sized store-buffer writes. Maybe the perf data can help make a decision about whether we go with your patch in the near term? > > > > In user space, glibc faces similar choices for its memcpy() functions; > > glibc memcpy() uses non-temporal stores for transfers > 75% of the > > L3 cache size divided by the number of cores. For example, with > > glibc-2.216-16.fc27 (August 2017), on a Broadwell system with > > E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used > > for memcpy()s over 36 MiB. > > BTW. what does glibc do with reads? Does it flush them from the cache > after they are consumed? > > AFAIK glibc doesn't support persistent memory - i.e. there is no function > that flushes data and the user has to use inline assembly for that. Yes, and I don't know of any copy routines that try to limit the cache pollution of pulling the source data for a copy, only the destination. > > It'd be nice if glibc, PMDK, and the kernel used the same algorithms. Yes, it would. Although I think PMDK would make a different decision than the kernel when optimizing for highest bandwidth for the local application vs bandwidth efficiency across all applications.
next prev parent reply other threads:[~2020-03-31 21:19 UTC|newest] Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-03-29 20:28 [PATCH] memcpy_flushcache: use cache flusing for larger lengths Mikulas Patocka 2020-03-29 20:28 ` Mikulas Patocka 2020-03-30 11:32 ` [PATCH v2] " Mikulas Patocka 2020-03-30 11:32 ` Mikulas Patocka 2020-03-31 0:28 ` Elliott, Robert (Servers) 2020-03-31 0:28 ` Elliott, Robert (Servers) 2020-03-31 11:58 ` Mikulas Patocka 2020-03-31 11:58 ` Mikulas Patocka 2020-03-31 21:19 ` Dan Williams [this message] 2020-03-31 21:19 ` Dan Williams 2020-04-01 16:26 ` Mikulas Patocka 2020-04-01 16:26 ` Mikulas Patocka 2020-04-01 16:26 ` Mikulas Patocka
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAPcyv4ijR185RLmtT+A+WZxJs309qPfdqj5eUDEkMgFbxsV+uw@mail.gmail.com \ --to=dan.j.williams@intel.com \ --cc=dm-devel@redhat.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-nvdimm@lists.01.org \ --cc=mpatocka@redhat.com \ --cc=msnitzer@redhat.com \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.