* Optane nvdimm performance
@ 2020-03-29 20:25 Mikulas Patocka
2020-03-30 12:37 ` Bruggeman, Otto (external - Partner)
0 siblings, 1 reply; 3+ messages in thread
From: Mikulas Patocka @ 2020-03-29 20:25 UTC (permalink / raw)
To: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny, Mike Snitzer
Cc: linux-nvdimm, dm-devel
Hi
I performed some microbenchmarks on a system with real Optane-based nvdimm
and I found out that the fastest method how to write to persistent memory
is to fill a cacheline with 8 8-byte writes and then issue clwb or
clflushopt on the cacheline. With this method, we can achieve 1.6 GB/s
throughput for linear writes. On the other hand, non-temporal writes
achieve only 1.3 GB/s.
The results are here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
The benchmarks here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/
The winning benchmark is this:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/thrp-write-8-clwb.c
However, the kernel is not using this fastest method, it is using
non-temporal stores instead.
I took the novafs filesystem (see git clone
https://github.com/NVSL/linux-nova), it uses
__copy_from_user_inatomic_nocache, which calls __copy_user_nocache which
performs non-temporal stores. I hacked __copy_user_nocache to use clwb
instead of non-temporal stores and it improved filesystem performance
significantly.
This is the patch
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/copy-nocache.patch
(for the kernel 5.1 because novafs needs this version) and these are
benchmark results:
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/fs-bench.txt
- you can see that "test2" has twice the write throughput of "test1"
I took the dm-writecache driver, it uses memcpy_flushcache to write data
to persistent memory. I hacked memcpy_flushcache to use clwb instead of
non-temporal stores.
The result is - for 512-byte writes, non-temporal stores perform better
than cache flushing. For 1024-byte and larger writes, cache flushing
performs better than non-temporal stores. (I also tried to use cached
writes + clwb for dm-writecache metadata updates, but it had bad
performance)
Do you have some explanation why nontemporal stores are better for
512-byte copies and worse for 1024-byte copies? (like filling up some
buffers inside the CPU)?
In the next email, I'm sending a patch that makes memcpy_flushcache use
clflushopt for transfers larger than 768 bytes.
Mikulas
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: Optane nvdimm performance
2020-03-29 20:25 Optane nvdimm performance Mikulas Patocka
@ 2020-03-30 12:37 ` Bruggeman, Otto (external - Partner)
2020-03-30 15:42 ` Bruggeman, Otto (external - Partner)
0 siblings, 1 reply; 3+ messages in thread
From: Bruggeman, Otto (external - Partner) @ 2020-03-30 12:37 UTC (permalink / raw)
To: Mikulas Patocka, Dan Williams, Vishal Verma, Dave Jiang,
Ira Weiny, Mike Snitzer
Cc: linux-nvdimm, dm-devel
FYI Mal sehen was da an antworten kommen...
-----Original Message-----
From: Mikulas Patocka <mpatocka@redhat.com>
Sent: Sunday, March 29, 2020 10:26 PM
To: Dan Williams <dan.j.williams@intel.com>; Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Ira Weiny <ira.weiny@intel.com>; Mike Snitzer <msnitzer@redhat.com>
Cc: linux-nvdimm@lists.01.org; dm-devel@redhat.com
Subject: Optane nvdimm performance
Hi
I performed some microbenchmarks on a system with real Optane-based nvdimm
and I found out that the fastest method how to write to persistent memory
is to fill a cacheline with 8 8-byte writes and then issue clwb or
clflushopt on the cacheline. With this method, we can achieve 1.6 GB/s
throughput for linear writes. On the other hand, non-temporal writes
achieve only 1.3 GB/s.
The results are here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
The benchmarks here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/
The winning benchmark is this:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/thrp-write-8-clwb.c
However, the kernel is not using this fastest method, it is using
non-temporal stores instead.
I took the novafs filesystem (see git clone
https://github.com/NVSL/linux-nova), it uses
__copy_from_user_inatomic_nocache, which calls __copy_user_nocache which
performs non-temporal stores. I hacked __copy_user_nocache to use clwb
instead of non-temporal stores and it improved filesystem performance
significantly.
This is the patch
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/copy-nocache.patch
(for the kernel 5.1 because novafs needs this version) and these are
benchmark results:
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/fs-bench.txt
- you can see that "test2" has twice the write throughput of "test1"
I took the dm-writecache driver, it uses memcpy_flushcache to write data
to persistent memory. I hacked memcpy_flushcache to use clwb instead of
non-temporal stores.
The result is - for 512-byte writes, non-temporal stores perform better
than cache flushing. For 1024-byte and larger writes, cache flushing
performs better than non-temporal stores. (I also tried to use cached
writes + clwb for dm-writecache metadata updates, but it had bad
performance)
Do you have some explanation why nontemporal stores are better for
512-byte copies and worse for 1024-byte copies? (like filling up some
buffers inside the CPU)?
In the next email, I'm sending a patch that makes memcpy_flushcache use
clflushopt for transfers larger than 768 bytes.
Mikulas
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: Optane nvdimm performance
2020-03-30 12:37 ` Bruggeman, Otto (external - Partner)
@ 2020-03-30 15:42 ` Bruggeman, Otto (external - Partner)
0 siblings, 0 replies; 3+ messages in thread
From: Bruggeman, Otto (external - Partner) @ 2020-03-30 15:42 UTC (permalink / raw)
To: Bruggeman, Otto (external - Partner),
Mikulas Patocka, Dan Williams, Vishal Verma, Dave Jiang,
Ira Weiny, Mike Snitzer
Cc: linux-nvdimm, dm-devel
My apologies, I meant to forward this mail and managed to press the wrong button...
-----Original Message-----
From: Bruggeman, Otto (external - Partner) <otto.bruggeman@sap.com>
Sent: Monday, March 30, 2020 2:38 PM
To: Mikulas Patocka <mpatocka@redhat.com>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Ira Weiny <ira.weiny@intel.com>; Mike Snitzer <msnitzer@redhat.com>
Cc: linux-nvdimm@lists.01.org; dm-devel@redhat.com
Subject: [CAUTION] RE: Optane nvdimm performance
FYI Mal sehen was da an antworten kommen...
-----Original Message-----
From: Mikulas Patocka <mpatocka@redhat.com>
Sent: Sunday, March 29, 2020 10:26 PM
To: Dan Williams <dan.j.williams@intel.com>; Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Ira Weiny <ira.weiny@intel.com>; Mike Snitzer <msnitzer@redhat.com>
Cc: linux-nvdimm@lists.01.org; dm-devel@redhat.com
Subject: Optane nvdimm performance
Hi
I performed some microbenchmarks on a system with real Optane-based nvdimm
and I found out that the fastest method how to write to persistent memory
is to fill a cacheline with 8 8-byte writes and then issue clwb or
clflushopt on the cacheline. With this method, we can achieve 1.6 GB/s
throughput for linear writes. On the other hand, non-temporal writes
achieve only 1.3 GB/s.
The results are here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
The benchmarks here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/
The winning benchmark is this:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/thrp-write-8-clwb.c
However, the kernel is not using this fastest method, it is using
non-temporal stores instead.
I took the novafs filesystem (see git clone
https://github.com/NVSL/linux-nova), it uses
__copy_from_user_inatomic_nocache, which calls __copy_user_nocache which
performs non-temporal stores. I hacked __copy_user_nocache to use clwb
instead of non-temporal stores and it improved filesystem performance
significantly.
This is the patch
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/copy-nocache.patch
(for the kernel 5.1 because novafs needs this version) and these are
benchmark results:
http://people.redhat.com/~mpatocka/testcases/pmem/benchmarks/fs-bench.txt
- you can see that "test2" has twice the write throughput of "test1"
I took the dm-writecache driver, it uses memcpy_flushcache to write data
to persistent memory. I hacked memcpy_flushcache to use clwb instead of
non-temporal stores.
The result is - for 512-byte writes, non-temporal stores perform better
than cache flushing. For 1024-byte and larger writes, cache flushing
performs better than non-temporal stores. (I also tried to use cached
writes + clwb for dm-writecache metadata updates, but it had bad
performance)
Do you have some explanation why nontemporal stores are better for
512-byte copies and worse for 1024-byte copies? (like filling up some
buffers inside the CPU)?
In the next email, I'm sending a patch that makes memcpy_flushcache use
clflushopt for transfers larger than 768 bytes.
Mikulas
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-03-30 15:43 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-29 20:25 Optane nvdimm performance Mikulas Patocka
2020-03-30 12:37 ` Bruggeman, Otto (external - Partner)
2020-03-30 15:42 ` Bruggeman, Otto (external - Partner)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).