linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: kirill@shutemov.name, mhocko@kernel.org,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	Ankur Arora <ankur.a.arora@oracle.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Kim Phillips <kim.phillips@amd.com>,
	Reinette Chatre <reinette.chatre@intel.com>,
	Tony Luck <tony.luck@intel.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Wei Huang <wei.huang2@amd.com>
Subject: [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen
Date: Wed, 14 Oct 2020 01:32:59 -0700	[thread overview]
Message-ID: <20201014083300.19077-9-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20201014083300.19077-1-ankur.a.arora@oracle.com>

System:           Oracle E2-2C
CPU:              2 nodes * 64 cores/node * 2 threads/core
                  AMD EPYC 7742 (Rome, 23:49:0)
Memory:           2048 GB evenly split between nodes
Microcode:        0x8301038
scaling_governor: performance
L3 size:          16 * 16MB
cpufreq/boost:    0

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
(X86_FEATURE_REP_GOOD) and x86-64-movnt (X86_FEATURE_NT_GOOD):

              x86-64-stosq (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      15.39 GB/s ( +- 9.14%)    14.56 GB/s ( +-19.43%)     -5.39%
    128MB      11.04 GB/s ( +- 4.87%)    14.49 GB/s ( +-13.22%)    +31.25%
   1024MB      11.86 GB/s ( +- 0.83%)    16.54 GB/s ( +- 0.04%)    +39.46%
   4096MB      11.89 GB/s ( +- 0.61%)    16.49 GB/s ( +- 0.28%)    +38.68%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <linux/mman.h>

 #define HPAGE_BITS 30

 int main(int argc, char **argv) {
	int i;
	unsigned long len = atoi(argv[1]); /* In GB */
	unsigned long offset = 0;
	unsigned long numpages;
	char *base;

	len *= 1UL << 30;
	numpages = len >> HPAGE_BITS;

	base = mmap(NULL, len, PROT_READ|PROT_WRITE,
	            MAP_PRIVATE | MAP_ANONYMOUS |
		    MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

	for (i = 0; i < numpages; i++) {
	        *((volatile char *)base + offset) = *(base + offset);
	        offset += 1UL << HPAGE_BITS;
	}

	return 0;
 }

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_rep(): 11.33 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    25,130,082,910      cpu-cycles                #    2.226 GHz                      ( +-  0.44% )  (54.54%)
     1,368,762,311      instructions              #    0.05  insn per cycle           ( +-  0.02% )  (54.54%)
     4,265,726,534      cache-references          #  377.794 M/sec                    ( +-  0.02% )  (54.54%)
       119,021,793      cache-misses              #    2.790 % of all cache refs      ( +-  3.90% )  (54.55%)
       413,825,787      branch-instructions       #   36.650 M/sec                    ( +-  0.01% )  (54.55%)
           236,847      branch-misses             #    0.06% of all branches          ( +- 18.80% )  (54.56%)
     2,152,320,887      L1-dcache-load-misses     #   40.40% of all L1-dcache accesses  ( +-  0.01% )  (54.55%)
     5,326,873,560      L1-dcache-loads           #  471.775 M/sec                    ( +-  0.20% )  (54.55%)
       828,943,234      L1-dcache-prefetches      #   73.415 M/sec                    ( +-  0.55% )  (54.54%)
            18,914      dTLB-loads                #    0.002 M/sec                    ( +- 47.23% )  (54.54%)
             4,423      dTLB-load-misses          #   23.38% of all dTLB cache accesses  ( +- 27.75% )  (54.54%)

           11.2917 +- 0.0499 seconds time elapsed  ( +-  0.44% )

Page-clearing throughput for clear_page_nt(): 16.29 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    17,523,166,924      cpu-cycles                #    2.230 GHz                      ( +-  0.03% )  (45.43%)
    24,801,270,826      instructions              #    1.42  insn per cycle           ( +-  0.01% )  (45.45%)
     2,151,391,033      cache-references          #  273.845 M/sec                    ( +-  0.01% )  (45.46%)
           168,555      cache-misses              #    0.008 % of all cache refs      ( +-  4.87% )  (45.47%)
     2,490,226,446      branch-instructions       #  316.974 M/sec                    ( +-  0.01% )  (45.48%)
           117,604      branch-misses             #    0.00% of all branches          ( +-  1.56% )  (45.48%)
           273,492      L1-dcache-load-misses     #    0.06% of all L1-dcache accesses  ( +-  2.14% )  (45.47%)
       490,340,458      L1-dcache-loads           #   62.414 M/sec                    ( +-  0.02% )  (45.45%)
            20,517      L1-dcache-prefetches      #    0.003 M/sec                    ( +-  9.61% )  (45.44%)
             7,413      dTLB-loads                #    0.944 K/sec                    ( +-  8.37% )  (45.44%)
             2,031      dTLB-load-misses          #   27.40% of all dTLB cache accesses  ( +-  8.30% )  (45.43%)

           7.85674 +- 0.00270 seconds time elapsed  ( +-  0.03% )

The L1-dcache-load-misses (L2$ access from DC Miss) count is
substantially lower which suggests we aren't doing write-allocate or
RFO. The L1-dcache-prefetches are also substantially lower.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSQ' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW shows a ~40% improvement. Additionally, a quick
'perf bench memset' comparison on AMD Naples (AMD EPYC 7551) shows
similar performance gains. So, enable X86_FEATURE_NT_GOOD for
AMD Zen.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index dcc3d943c68f..c57eb6c28aa1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -918,6 +918,9 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
 	set_cpu_cap(c, X86_FEATURE_ZEN);
 
+	if (c->x86 == 0x17)
+		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
+
 #ifdef CONFIG_NUMA
 	node_reclaim_distance = 32;
 #endif
-- 
2.9.3


      parent reply	other threads:[~2020-10-14  8:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
2020-10-14 19:56   ` Borislav Petkov
2020-10-14 21:11     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10   ` kernel test robot
2020-10-14 13:04   ` kernel test robot
2020-10-14 15:45   ` Andy Lutomirski
2020-10-14 19:58     ` Borislav Petkov
2020-10-14 21:07       ` Andy Lutomirski
2020-10-14 21:12         ` Borislav Petkov
2020-10-15  3:37           ` Ankur Arora
2020-10-15 10:35             ` Borislav Petkov
2020-10-15 21:20               ` Ankur Arora
2020-10-16 18:21                 ` Borislav Petkov
2020-10-15  3:21         ` Ankur Arora
2020-10-15 10:40           ` Borislav Petkov
2020-10-15 21:40             ` Ankur Arora
2020-10-14 20:54     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
2020-10-14 15:28   ` Ingo Molnar
2020-10-14 19:15     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
2020-10-14 15:31   ` Ingo Molnar
2020-10-14 19:23     ` Ankur Arora
2020-10-14  8:32 ` Ankur Arora [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201014083300.19077-9-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=kim.phillips@amd.com \
    --cc=kirill@shutemov.name \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=wei.huang2@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).