Re: x86: memset() / clear_page() / page scrubbing

From: Jan Beulich <jbeulich@suse.com>
To: Ankur Arora <ankur.a.arora@oracle.com>
Cc: andrew.cooper3@citrix.com, roger.pau@citrix.com,
	xen-devel@lists.xenproject.org
Subject: Re: x86: memset() / clear_page() / page scrubbing
Date: Fri, 9 Apr 2021 08:38:59 +0200	[thread overview]
Message-ID: <4d8202b3-ffe8-c4e7-e477-d8e7dc294c33@suse.com> (raw)
In-Reply-To: <20210409060845.3503745-1-ankur.a.arora@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]

On 09.04.2021 08:08, Ankur Arora wrote:
> I'm working on somewhat related optimizations on Linux (clear_page(),
> going in the opposite direction, from REP STOSB to MOVNT) and have
> some comments/questions below.

Interesting.

> On 4/8/2021 6:58 AM, Jan Beulich wrote:
>> All,
>>
>> since over the years we've been repeatedly talking of changing the
>> implementation of these fundamental functions, I've taken some time
>> to do some measurements (just for possible clear_page() alternatives
>> to keep things manageable). I'm not sure I want to spend as much time
>> subsequently on memcpy() / copy_page() (or more, because there are
>> yet more combinations of arguments to consider), so for the moment I
>> think the route we're going to pick here is going to more or less
>> also apply to those.
>>
>> The present copy_page() is the way it is because of the desire to
>> avoid disturbing the cache. The effect of REP STOS on the L1 cache
>> (compared to the present use of MOVNTI) is more or less noticable on
>> all hardware, and at least on Intel hardware more noticable when the
>> cache starts out clean. For L2 the results are more mixed when
>> comparing cache-clean and cache-filled cases, but the difference
>> between MOVNTI and REP STOS remains or (at least on Zen2 and older
>> Intel hardware) becomes more prominent.
> 
> Could you give me any pointers on the cache-effects on this? This
> obviously makes sense but I couldn't come up with any benchmarks
> which would show this in a straight-forward fashion.

No benchmarks in that sense, but a local debugging patch measuring
things before bringing up APs, to have a reasonably predictable
environment. I have attached it for your reference.

>> Otoh REP STOS, as was to be expected, in most cases has meaningfully
>> lower latency than MOVNTI.
>>
>> Because I was curious I also included AVX (32-byte stores), AVX512
>> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
>> clear win except on the vendors' first generations implementing it
>> (but I've left out any playing with CR0.TS, which is what I expect
>> would take this out as an option), AVX512 isn't on Skylake (perhaps
>> newer hardware does better). CLZERO has slightly higher impact on
>> L1 than MOVNTI, but lower than REP STOS.
> 
> Could you elaborate on what kind of difference in L1 impact you are
> talking about? Evacuation of cachelines?

Replacement of ones, yes. As you may see from that patch, I prefill
the cache, do the clearing, and then measure how much longer the
same operation takes that was used for prefilling. If the clearing
left the cache completely alone (or if the hw prefetcher was really
good), there would be no difference.

>> Its latency is between
>> both when the caches are warm, and better than both when the caches
>> are cold.
>>
>> Therefore I think that we want to distinguish page clearing (where
>> we care about latency) from (background) page scrubbing (where I
>> think the goal ought to be to avoid disturbing the caches). That
>> would make it
>> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>>   synchronous scrubbing),
>> - MOVNTI for scrub_page() (when done from idle context), unless
>>   CLZERO is available.
>> Whether in addition we should take into consideration activity of
>> other (logical) CPUs sharing caches I don't know - this feels like
>> it could get complex pretty quickly.
> 
> The one other case might be for ~L3 (or larger) regions. In my tests,
> MOVNT/CLZERO is almost always better (the one exception being Skylake)
> wrt both cache and latency for larger extents.

Good to know - will keep this in mind.

> In the particular cases I was looking at (mmap+MAP_POPULATE and
> page-fault path), that makes the choice of always using MOVNT/CLZERO
> easy for GB pages, but fuzzier for 2MB pages.
> 
> Not sure if the large-page case is interesting for you though.

Well, we never fill large pages in one go, yet the scrubbing may
touch many individual pages in close succession. But for the
(background) scrubbing my recommendation is to use MOVNT/CLZERO
anyway, irrespective of volume. While upon large page allocations
we may also end up scrubbing many pages in close succession, I'm
not sure that's worth optimizing for - we at least hope for the
pages to have got scrubbed in the background before they get
re-used. Plus we don't (currently) know up front how many of them
may still need scrubbing; this isn't difficult to at least
estimate, but may require yet another loop over the constituent
pages.

Jan

[-- Attachment #2: x86-clear-page-ERMS.patch --]
[-- Type: text/plain, Size: 6505 bytes --]


TODO: remove (or split out) //temp-s
Note: Ankur indicates that for ~L3-size or larger regions MOVNT/CLZERO is better even latency-wise

--- unstable.orig/xen/arch/x86/clear_page.S	2021-02-25 09:28:14.175636881 +0100
+++ unstable/xen/arch/x86/clear_page.S	2021-02-25 10:04:04.315325973 +0100
@@ -16,3 +16,66 @@ ENTRY(clear_page_sse2)
 
         sfence
         ret
+
+ENTRY(clear_page_stosb)
+        mov     $PAGE_SIZE, %ecx
+        xor     %eax,%eax
+        rep stosb
+        ret
+
+ENTRY(clear_page_stosl)
+        mov     $PAGE_SIZE/4, %ecx
+        xor     %eax, %eax
+        rep stosl
+        ret
+
+ENTRY(clear_page_stosq)
+        mov     $PAGE_SIZE/8, %ecx
+        xor     %eax, %eax
+        rep stosq
+        ret
+
+ENTRY(clear_page_avx)
+        mov     $PAGE_SIZE/128, %ecx
+        vpxor   %xmm0, %xmm0, %xmm0
+0:      vmovntdq %ymm0,   (%rdi)
+        vmovntdq %ymm0, 32(%rdi)
+        vmovntdq %ymm0, 64(%rdi)
+        vmovntdq %ymm0, 96(%rdi)
+        sub     $-128, %rdi
+        sub     $1, %ecx
+        jnz     0b
+        sfence
+        ret
+
+#if __GNUC__ > 4
+ENTRY(clear_page_avx512)
+        mov     $PAGE_SIZE/256, %ecx
+        vpxor   %xmm0, %xmm0, %xmm0
+0:      vmovntdq %zmm0,    (%rdi)
+        vmovntdq %zmm0,  64(%rdi)
+        vmovntdq %zmm0, 128(%rdi)
+        vmovntdq %zmm0, 192(%rdi)
+        add     $256, %rdi
+        sub     $1, %ecx
+        jnz     0b
+        sfence
+        ret
+#endif
+
+#if __GNUC__ > 5
+ENTRY(clear_page_clzero)
+        mov     %rdi, %rax
+        mov     $PAGE_SIZE/256, %ecx
+0:      clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        sub     $1, %ecx
+        jnz     0b
+        ret
+#endif
--- unstable.orig/xen/arch/x86/cpu/common.c	2021-02-09 16:20:45.000000000 +0100
+++ unstable/xen/arch/x86/cpu/common.c	2021-02-09 16:20:45.000000000 +0100
@@ -238,6 +238,7 @@ int get_model_name(struct cpuinfo_x86 *c
 }
 
 
+extern unsigned l1d_size, l2_size;//temp
 void display_cacheinfo(struct cpuinfo_x86 *c)
 {
 	unsigned int dummy, ecx, edx, size;
@@ -250,6 +251,7 @@ void display_cacheinfo(struct cpuinfo_x8
 				              " D cache %uK (%u bytes/line)\n",
 				       edx >> 24, edx & 0xFF, ecx >> 24, ecx & 0xFF);
 			c->x86_cache_size = (ecx >> 24) + (edx >> 24);
+if(ecx >>= 24) l1d_size = ecx;//temp
 		}
 	}
 
@@ -260,6 +262,7 @@ void display_cacheinfo(struct cpuinfo_x8
 
 	size = ecx >> 16;
 	if (size) {
+l2_size =//temp
 		c->x86_cache_size = size;
 
 		if (opt_cpu_info)
--- unstable.orig/xen/arch/x86/cpu/intel_cacheinfo.c	2021-02-25 09:28:14.175636881 +0100
+++ unstable/xen/arch/x86/cpu/intel_cacheinfo.c	2021-02-09 16:20:23.000000000 +0100
@@ -116,6 +116,7 @@ static int find_num_cache_leaves(void)
 	return i;
 }
 
+extern unsigned l1d_size, l2_size;//temp
 void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 {
 	unsigned int trace = 0, l1i = 0, l1d = 0, l2 = 0, l3 = 0; /* Cache sizes */
@@ -230,12 +231,14 @@ void init_intel_cacheinfo(struct cpuinfo
 	}
 
 	if (new_l1d)
+l1d_size =//temp
 		l1d = new_l1d;
 
 	if (new_l1i)
 		l1i = new_l1i;
 
 	if (new_l2) {
+l2_size =//temp
 		l2 = new_l2;
 	}
 
--- unstable.orig/xen/arch/x86/mm.c	2021-02-25 09:28:41.215745784 +0100
+++ unstable/xen/arch/x86/mm.c	2021-04-06 15:44:32.478099453 +0200
@@ -284,6 +284,22 @@ static void __init assign_io_page(struct
     page->count_info |= PGC_allocated | 1;
 }
 
+static unsigned __init noinline probe(const unsigned*spc, unsigned nr) {//temp
+#define PAGE_ENTS (PAGE_SIZE / sizeof(*spc))
+ unsigned i, j, acc;
+ for(acc = i = 0; i < PAGE_SIZE / 64; ++i)
+  for(j = 0; j < nr; ++j)
+   acc += spc[j * PAGE_ENTS + ((i * (64 / sizeof(*spc)) * 7) & (PAGE_ENTS - 1))];
+ return acc & (i * nr - 1);
+#undef PAGE_ENTS
+}
+extern void clear_page_stosb(void*);//temp
+extern void clear_page_stosl(void*);//temp
+extern void clear_page_stosq(void*);//temp
+extern void clear_page_avx(void*);//temp
+extern void clear_page_avx512(void*);//temp
+extern void clear_page_clzero(void*);//temp
+unsigned l1d_size = KB(16), l2_size;//temp
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -392,6 +408,67 @@ void __init arch_init_memory(void)
     }
 #endif
 
+{//temp
+ unsigned order = get_order_from_pages(PFN_DOWN(l2_size << 10)) ?: 1;
+ void*fill = alloc_xenheap_pages(order, 0);
+ void*buf = alloc_xenheap_pages(order - 1, 0);
+ unsigned long cr0 = read_cr0();
+ printk("erms=%d fsrm=%d fzrm=%d fsrs=%d fsrcs=%d l1d=%uk l2=%uk\n",
+        !!boot_cpu_has(X86_FEATURE_ERMS), !!boot_cpu_has(X86_FEATURE_FSRM),
+        !!boot_cpu_has(X86_FEATURE_FZRM), !!boot_cpu_has(X86_FEATURE_FSRS),
+        !!boot_cpu_has(X86_FEATURE_FSRCS), l1d_size, l2_size);
+ clts();
+ for(unsigned pass = 0; pass < 4; ++pass) {
+  printk("L%d w/%s flush:\n", 2 - !(pass & 2), pass & 1 ? "" : "o");
+  wbinvd();
+  for(i = 0; fill && buf && i < 3; ++i) {
+   unsigned nr = PFN_DOWN((pass & 2 ? l2_size : l1d_size) << 10);
+   uint64_t start, pre, clr, post;
+
+#define CHK(kind) do { \
+ /* local_irq_disable(); */ \
+\
+ memset(buf, __LINE__ | (__LINE__ >> 8), nr * PAGE_SIZE / 2); \
+ if(pass & 1) wbinvd(); else mb(); \
+ memset(fill, __LINE__ | (__LINE__ >> 8), nr * PAGE_SIZE); \
+ mb(); \
+\
+ if(boot_cpu_has(X86_FEATURE_IBRSB) || boot_cpu_has(X86_FEATURE_IBPB)) \
+  wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB); \
+ start = rdtsc_ordered(); \
+ if(probe(fill, nr)) BUG(); \
+ pre = rdtsc_ordered() - start; \
+\
+ start = rdtsc_ordered(); \
+ for(pfn = 0; pfn < nr / 2; ++pfn) \
+  clear_page_##kind(buf + pfn * PAGE_SIZE); \
+ clr = rdtsc_ordered() - start; \
+\
+ if(boot_cpu_has(X86_FEATURE_IBRSB) || boot_cpu_has(X86_FEATURE_IBPB)) \
+  wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB); \
+ start = rdtsc_ordered(); \
+ if(probe(fill, nr)) BUG(); \
+ post = rdtsc_ordered() - start; \
+\
+ /* local_irq_enable(); */ \
+ printk(" pre=%lx " #kind "=%lx post=%lx\n", pre, clr, post); \
+} while(0)
+
+   CHK(sse2);
+   CHK(stosb);
+   CHK(stosl);
+   CHK(stosq);
+   if(boot_cpu_has(X86_FEATURE_AVX)) CHK(avx);
+   if(__GNUC__ > 4 && boot_cpu_has(X86_FEATURE_AVX512F)) CHK(avx512);
+   if(__GNUC__ > 5 && boot_cpu_has(X86_FEATURE_CLZERO)) CHK(clzero);
+
+#undef CHK
+  }
+ }
+ write_cr0(cr0);
+ free_xenheap_pages(buf, order - 1);
+ free_xenheap_pages(fill, order);
+}
     /* Generate a symbol to be used in linker script */
     ASM_CONSTANT(FIXADDR_X_SIZE, FIXADDR_X_SIZE);
 }