All of lore.kernel.org
 help / color / mirror / Atom feed
* x86: memset() / clear_page() / page scrubbing
@ 2021-04-08 13:58 Jan Beulich
  2021-04-09  6:08 ` Ankur Arora
  2021-04-13 13:17 ` Andrew Cooper
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Beulich @ 2021-04-08 13:58 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Roger Pau Monné

All,

since over the years we've been repeatedly talking of changing the
implementation of these fundamental functions, I've taken some time
to do some measurements (just for possible clear_page() alternatives
to keep things manageable). I'm not sure I want to spend as much time
subsequently on memcpy() / copy_page() (or more, because there are
yet more combinations of arguments to consider), so for the moment I
think the route we're going to pick here is going to more or less
also apply to those.

The present copy_page() is the way it is because of the desire to
avoid disturbing the cache. The effect of REP STOS on the L1 cache
(compared to the present use of MOVNTI) is more or less noticable on
all hardware, and at least on Intel hardware more noticable when the
cache starts out clean. For L2 the results are more mixed when
comparing cache-clean and cache-filled cases, but the difference
between MOVNTI and REP STOS remains or (at least on Zen2 and older
Intel hardware) becomes more prominent.

Otoh REP STOS, as was to be expected, in most cases has meaningfully
lower latency than MOVNTI.

Because I was curious I also included AVX (32-byte stores), AVX512
(64-byte stores), and AMD's CLZERO in my testing. While AVX is a
clear win except on the vendors' first generations implementing it
(but I've left out any playing with CR0.TS, which is what I expect
would take this out as an option), AVX512 isn't on Skylake (perhaps
newer hardware does better). CLZERO has slightly higher impact on
L1 than MOVNTI, but lower than REP STOS. Its latency is between
both when the caches are warm, and better than both when the caches
are cold.

Therefore I think that we want to distinguish page clearing (where
we care about latency) from (background) page scrubbing (where I
think the goal ought to be to avoid disturbing the caches). That
would make it
- REP STOS{L,Q} for clear_page() (perhaps also to be used for
  synchronous scrubbing),
- MOVNTI for scrub_page() (when done from idle context), unless
  CLZERO is available.
Whether in addition we should take into consideration activity of
other (logical) CPUs sharing caches I don't know - this feels like
it could get complex pretty quickly.

For memset() we already simply use REP STOSB. I don't see a strong
need to change that, but it may be worth to consider bringing it
closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
They perform somewhat better in a number of cases (including when
ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
what I would have expected). We may want to put the whole thing in
a .S file though, seeing that the C function right now consists of
little more than an asm().

For memcpy() I'm inclined to suggest that we simply use REP MOVSB
on ERMS hardware, and stay with what we have everywhere else.

copy_page() (or really copy_domain_page()) doesn't have many uses,
so I'm not sure how worthwhile it is to do much optimization there.
It might be an option to simply expand it to memcpy(), like Arm
does.

Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
may want to figure out whether using these for strlen(), strcmp(),
strchr(), memchr(), and/or memcmp() would be a win.

Thoughts anyone, before I start creating actual patches?

Jan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-08 13:58 x86: memset() / clear_page() / page scrubbing Jan Beulich
@ 2021-04-09  6:08 ` Ankur Arora
  2021-04-09  6:38   ` Jan Beulich
  2021-04-13 13:17 ` Andrew Cooper
  1 sibling, 1 reply; 9+ messages in thread
From: Ankur Arora @ 2021-04-09  6:08 UTC (permalink / raw)
  To: jbeulich; +Cc: andrew.cooper3, roger.pau, xen-devel

Hi Jan,

I'm working on somewhat related optimizations on Linux (clear_page(),
going in the opposite direction, from REP STOSB to MOVNT) and have
some comments/questions below.

(Discussion on v1 here:
https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/)

On 4/8/2021 6:58 AM, Jan Beulich wrote:
> All,
>
> since over the years we've been repeatedly talking of changing the
> implementation of these fundamental functions, I've taken some time
> to do some measurements (just for possible clear_page() alternatives
> to keep things manageable). I'm not sure I want to spend as much time
> subsequently on memcpy() / copy_page() (or more, because there are
> yet more combinations of arguments to consider), so for the moment I
> think the route we're going to pick here is going to more or less
> also apply to those.
>
> The present copy_page() is the way it is because of the desire to
> avoid disturbing the cache. The effect of REP STOS on the L1 cache
> (compared to the present use of MOVNTI) is more or less noticable on
> all hardware, and at least on Intel hardware more noticable when the
> cache starts out clean. For L2 the results are more mixed when
> comparing cache-clean and cache-filled cases, but the difference
> between MOVNTI and REP STOS remains or (at least on Zen2 and older
> Intel hardware) becomes more prominent.

Could you give me any pointers on the cache-effects on this? This
obviously makes sense but I couldn't come up with any benchmarks
which would show this in a straight-forward fashion.

>
> Otoh REP STOS, as was to be expected, in most cases has meaningfully
> lower latency than MOVNTI.
>
> Because I was curious I also included AVX (32-byte stores), AVX512
> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
> clear win except on the vendors' first generations implementing it
> (but I've left out any playing with CR0.TS, which is what I expect
> would take this out as an option), AVX512 isn't on Skylake (perhaps
> newer hardware does better). CLZERO has slightly higher impact on
> L1 than MOVNTI, but lower than REP STOS.

Could you elaborate on what kind of difference in L1 impact you are
talking about? Evacuation of cachelines?

> Its latency is between
> both when the caches are warm, and better than both when the caches
> are cold.
>
> Therefore I think that we want to distinguish page clearing (where
> we care about latency) from (background) page scrubbing (where I
> think the goal ought to be to avoid disturbing the caches). That
> would make it
> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>   synchronous scrubbing),
> - MOVNTI for scrub_page() (when done from idle context), unless
>   CLZERO is available.
> Whether in addition we should take into consideration activity of
> other (logical) CPUs sharing caches I don't know - this feels like
> it could get complex pretty quickly.

The one other case might be for ~L3 (or larger) regions. In my tests,
MOVNT/CLZERO is almost always better (the one exception being Skylake)
wrt both cache and latency for larger extents.

In the particular cases I was looking at (mmap+MAP_POPULATE and
page-fault path), that makes the choice of always using MOVNT/CLZERO
easy for GB pages, but fuzzier for 2MB pages.

Not sure if the large-page case is interesting for you though.


Thanks
Ankur

>
> For memset() we already simply use REP STOSB. I don't see a strong
> need to change that, but it may be worth to consider bringing it
> closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
> They perform somewhat better in a number of cases (including when
> ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
> what I would have expected). We may want to put the whole thing in
> a .S file though, seeing that the C function right now consists of
> little more than an asm().
>
> For memcpy() I'm inclined to suggest that we simply use REP MOVSB
> on ERMS hardware, and stay with what we have everywhere else.
>
> copy_page() (or really copy_domain_page()) doesn't have many uses,
> so I'm not sure how worthwhile it is to do much optimization there.
> It might be an option to simply expand it to memcpy(), like Arm
> does.
>
> Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
> may want to figure out whether using these for strlen(), strcmp(),
> strchr(), memchr(), and/or memcmp() would be a win.
>
> Thoughts anyone, before I start creating actual patches?
>
> Jan
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-09  6:08 ` Ankur Arora
@ 2021-04-09  6:38   ` Jan Beulich
  2021-04-09 21:01     ` Ankur Arora
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2021-04-09  6:38 UTC (permalink / raw)
  To: Ankur Arora; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]

On 09.04.2021 08:08, Ankur Arora wrote:
> I'm working on somewhat related optimizations on Linux (clear_page(),
> going in the opposite direction, from REP STOSB to MOVNT) and have
> some comments/questions below.

Interesting.

> On 4/8/2021 6:58 AM, Jan Beulich wrote:
>> All,
>>
>> since over the years we've been repeatedly talking of changing the
>> implementation of these fundamental functions, I've taken some time
>> to do some measurements (just for possible clear_page() alternatives
>> to keep things manageable). I'm not sure I want to spend as much time
>> subsequently on memcpy() / copy_page() (or more, because there are
>> yet more combinations of arguments to consider), so for the moment I
>> think the route we're going to pick here is going to more or less
>> also apply to those.
>>
>> The present copy_page() is the way it is because of the desire to
>> avoid disturbing the cache. The effect of REP STOS on the L1 cache
>> (compared to the present use of MOVNTI) is more or less noticable on
>> all hardware, and at least on Intel hardware more noticable when the
>> cache starts out clean. For L2 the results are more mixed when
>> comparing cache-clean and cache-filled cases, but the difference
>> between MOVNTI and REP STOS remains or (at least on Zen2 and older
>> Intel hardware) becomes more prominent.
> 
> Could you give me any pointers on the cache-effects on this? This
> obviously makes sense but I couldn't come up with any benchmarks
> which would show this in a straight-forward fashion.

No benchmarks in that sense, but a local debugging patch measuring
things before bringing up APs, to have a reasonably predictable
environment. I have attached it for your reference.

>> Otoh REP STOS, as was to be expected, in most cases has meaningfully
>> lower latency than MOVNTI.
>>
>> Because I was curious I also included AVX (32-byte stores), AVX512
>> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
>> clear win except on the vendors' first generations implementing it
>> (but I've left out any playing with CR0.TS, which is what I expect
>> would take this out as an option), AVX512 isn't on Skylake (perhaps
>> newer hardware does better). CLZERO has slightly higher impact on
>> L1 than MOVNTI, but lower than REP STOS.
> 
> Could you elaborate on what kind of difference in L1 impact you are
> talking about? Evacuation of cachelines?

Replacement of ones, yes. As you may see from that patch, I prefill
the cache, do the clearing, and then measure how much longer the
same operation takes that was used for prefilling. If the clearing
left the cache completely alone (or if the hw prefetcher was really
good), there would be no difference.

>> Its latency is between
>> both when the caches are warm, and better than both when the caches
>> are cold.
>>
>> Therefore I think that we want to distinguish page clearing (where
>> we care about latency) from (background) page scrubbing (where I
>> think the goal ought to be to avoid disturbing the caches). That
>> would make it
>> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>>   synchronous scrubbing),
>> - MOVNTI for scrub_page() (when done from idle context), unless
>>   CLZERO is available.
>> Whether in addition we should take into consideration activity of
>> other (logical) CPUs sharing caches I don't know - this feels like
>> it could get complex pretty quickly.
> 
> The one other case might be for ~L3 (or larger) regions. In my tests,
> MOVNT/CLZERO is almost always better (the one exception being Skylake)
> wrt both cache and latency for larger extents.

Good to know - will keep this in mind.

> In the particular cases I was looking at (mmap+MAP_POPULATE and
> page-fault path), that makes the choice of always using MOVNT/CLZERO
> easy for GB pages, but fuzzier for 2MB pages.
> 
> Not sure if the large-page case is interesting for you though.

Well, we never fill large pages in one go, yet the scrubbing may
touch many individual pages in close succession. But for the
(background) scrubbing my recommendation is to use MOVNT/CLZERO
anyway, irrespective of volume. While upon large page allocations
we may also end up scrubbing many pages in close succession, I'm
not sure that's worth optimizing for - we at least hope for the
pages to have got scrubbed in the background before they get
re-used. Plus we don't (currently) know up front how many of them
may still need scrubbing; this isn't difficult to at least
estimate, but may require yet another loop over the constituent
pages.

Jan

[-- Attachment #2: x86-clear-page-ERMS.patch --]
[-- Type: text/plain, Size: 6505 bytes --]


TODO: remove (or split out) //temp-s
Note: Ankur indicates that for ~L3-size or larger regions MOVNT/CLZERO is better even latency-wise

--- unstable.orig/xen/arch/x86/clear_page.S	2021-02-25 09:28:14.175636881 +0100
+++ unstable/xen/arch/x86/clear_page.S	2021-02-25 10:04:04.315325973 +0100
@@ -16,3 +16,66 @@ ENTRY(clear_page_sse2)
 
         sfence
         ret
+
+ENTRY(clear_page_stosb)
+        mov     $PAGE_SIZE, %ecx
+        xor     %eax,%eax
+        rep stosb
+        ret
+
+ENTRY(clear_page_stosl)
+        mov     $PAGE_SIZE/4, %ecx
+        xor     %eax, %eax
+        rep stosl
+        ret
+
+ENTRY(clear_page_stosq)
+        mov     $PAGE_SIZE/8, %ecx
+        xor     %eax, %eax
+        rep stosq
+        ret
+
+ENTRY(clear_page_avx)
+        mov     $PAGE_SIZE/128, %ecx
+        vpxor   %xmm0, %xmm0, %xmm0
+0:      vmovntdq %ymm0,   (%rdi)
+        vmovntdq %ymm0, 32(%rdi)
+        vmovntdq %ymm0, 64(%rdi)
+        vmovntdq %ymm0, 96(%rdi)
+        sub     $-128, %rdi
+        sub     $1, %ecx
+        jnz     0b
+        sfence
+        ret
+
+#if __GNUC__ > 4
+ENTRY(clear_page_avx512)
+        mov     $PAGE_SIZE/256, %ecx
+        vpxor   %xmm0, %xmm0, %xmm0
+0:      vmovntdq %zmm0,    (%rdi)
+        vmovntdq %zmm0,  64(%rdi)
+        vmovntdq %zmm0, 128(%rdi)
+        vmovntdq %zmm0, 192(%rdi)
+        add     $256, %rdi
+        sub     $1, %ecx
+        jnz     0b
+        sfence
+        ret
+#endif
+
+#if __GNUC__ > 5
+ENTRY(clear_page_clzero)
+        mov     %rdi, %rax
+        mov     $PAGE_SIZE/256, %ecx
+0:      clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        clzero
+        add     $64, %rax
+        sub     $1, %ecx
+        jnz     0b
+        ret
+#endif
--- unstable.orig/xen/arch/x86/cpu/common.c	2021-02-09 16:20:45.000000000 +0100
+++ unstable/xen/arch/x86/cpu/common.c	2021-02-09 16:20:45.000000000 +0100
@@ -238,6 +238,7 @@ int get_model_name(struct cpuinfo_x86 *c
 }
 
 
+extern unsigned l1d_size, l2_size;//temp
 void display_cacheinfo(struct cpuinfo_x86 *c)
 {
 	unsigned int dummy, ecx, edx, size;
@@ -250,6 +251,7 @@ void display_cacheinfo(struct cpuinfo_x8
 				              " D cache %uK (%u bytes/line)\n",
 				       edx >> 24, edx & 0xFF, ecx >> 24, ecx & 0xFF);
 			c->x86_cache_size = (ecx >> 24) + (edx >> 24);
+if(ecx >>= 24) l1d_size = ecx;//temp
 		}
 	}
 
@@ -260,6 +262,7 @@ void display_cacheinfo(struct cpuinfo_x8
 
 	size = ecx >> 16;
 	if (size) {
+l2_size =//temp
 		c->x86_cache_size = size;
 
 		if (opt_cpu_info)
--- unstable.orig/xen/arch/x86/cpu/intel_cacheinfo.c	2021-02-25 09:28:14.175636881 +0100
+++ unstable/xen/arch/x86/cpu/intel_cacheinfo.c	2021-02-09 16:20:23.000000000 +0100
@@ -116,6 +116,7 @@ static int find_num_cache_leaves(void)
 	return i;
 }
 
+extern unsigned l1d_size, l2_size;//temp
 void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 {
 	unsigned int trace = 0, l1i = 0, l1d = 0, l2 = 0, l3 = 0; /* Cache sizes */
@@ -230,12 +231,14 @@ void init_intel_cacheinfo(struct cpuinfo
 	}
 
 	if (new_l1d)
+l1d_size =//temp
 		l1d = new_l1d;
 
 	if (new_l1i)
 		l1i = new_l1i;
 
 	if (new_l2) {
+l2_size =//temp
 		l2 = new_l2;
 	}
 
--- unstable.orig/xen/arch/x86/mm.c	2021-02-25 09:28:41.215745784 +0100
+++ unstable/xen/arch/x86/mm.c	2021-04-06 15:44:32.478099453 +0200
@@ -284,6 +284,22 @@ static void __init assign_io_page(struct
     page->count_info |= PGC_allocated | 1;
 }
 
+static unsigned __init noinline probe(const unsigned*spc, unsigned nr) {//temp
+#define PAGE_ENTS (PAGE_SIZE / sizeof(*spc))
+ unsigned i, j, acc;
+ for(acc = i = 0; i < PAGE_SIZE / 64; ++i)
+  for(j = 0; j < nr; ++j)
+   acc += spc[j * PAGE_ENTS + ((i * (64 / sizeof(*spc)) * 7) & (PAGE_ENTS - 1))];
+ return acc & (i * nr - 1);
+#undef PAGE_ENTS
+}
+extern void clear_page_stosb(void*);//temp
+extern void clear_page_stosl(void*);//temp
+extern void clear_page_stosq(void*);//temp
+extern void clear_page_avx(void*);//temp
+extern void clear_page_avx512(void*);//temp
+extern void clear_page_clzero(void*);//temp
+unsigned l1d_size = KB(16), l2_size;//temp
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -392,6 +408,67 @@ void __init arch_init_memory(void)
     }
 #endif
 
+{//temp
+ unsigned order = get_order_from_pages(PFN_DOWN(l2_size << 10)) ?: 1;
+ void*fill = alloc_xenheap_pages(order, 0);
+ void*buf = alloc_xenheap_pages(order - 1, 0);
+ unsigned long cr0 = read_cr0();
+ printk("erms=%d fsrm=%d fzrm=%d fsrs=%d fsrcs=%d l1d=%uk l2=%uk\n",
+        !!boot_cpu_has(X86_FEATURE_ERMS), !!boot_cpu_has(X86_FEATURE_FSRM),
+        !!boot_cpu_has(X86_FEATURE_FZRM), !!boot_cpu_has(X86_FEATURE_FSRS),
+        !!boot_cpu_has(X86_FEATURE_FSRCS), l1d_size, l2_size);
+ clts();
+ for(unsigned pass = 0; pass < 4; ++pass) {
+  printk("L%d w/%s flush:\n", 2 - !(pass & 2), pass & 1 ? "" : "o");
+  wbinvd();
+  for(i = 0; fill && buf && i < 3; ++i) {
+   unsigned nr = PFN_DOWN((pass & 2 ? l2_size : l1d_size) << 10);
+   uint64_t start, pre, clr, post;
+
+#define CHK(kind) do { \
+ /* local_irq_disable(); */ \
+\
+ memset(buf, __LINE__ | (__LINE__ >> 8), nr * PAGE_SIZE / 2); \
+ if(pass & 1) wbinvd(); else mb(); \
+ memset(fill, __LINE__ | (__LINE__ >> 8), nr * PAGE_SIZE); \
+ mb(); \
+\
+ if(boot_cpu_has(X86_FEATURE_IBRSB) || boot_cpu_has(X86_FEATURE_IBPB)) \
+  wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB); \
+ start = rdtsc_ordered(); \
+ if(probe(fill, nr)) BUG(); \
+ pre = rdtsc_ordered() - start; \
+\
+ start = rdtsc_ordered(); \
+ for(pfn = 0; pfn < nr / 2; ++pfn) \
+  clear_page_##kind(buf + pfn * PAGE_SIZE); \
+ clr = rdtsc_ordered() - start; \
+\
+ if(boot_cpu_has(X86_FEATURE_IBRSB) || boot_cpu_has(X86_FEATURE_IBPB)) \
+  wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB); \
+ start = rdtsc_ordered(); \
+ if(probe(fill, nr)) BUG(); \
+ post = rdtsc_ordered() - start; \
+\
+ /* local_irq_enable(); */ \
+ printk(" pre=%lx " #kind "=%lx post=%lx\n", pre, clr, post); \
+} while(0)
+
+   CHK(sse2);
+   CHK(stosb);
+   CHK(stosl);
+   CHK(stosq);
+   if(boot_cpu_has(X86_FEATURE_AVX)) CHK(avx);
+   if(__GNUC__ > 4 && boot_cpu_has(X86_FEATURE_AVX512F)) CHK(avx512);
+   if(__GNUC__ > 5 && boot_cpu_has(X86_FEATURE_CLZERO)) CHK(clzero);
+
+#undef CHK
+  }
+ }
+ write_cr0(cr0);
+ free_xenheap_pages(buf, order - 1);
+ free_xenheap_pages(fill, order);
+}
     /* Generate a symbol to be used in linker script */
     ASM_CONSTANT(FIXADDR_X_SIZE, FIXADDR_X_SIZE);
 }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-09  6:38   ` Jan Beulich
@ 2021-04-09 21:01     ` Ankur Arora
  2021-04-12  9:15       ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Ankur Arora @ 2021-04-09 21:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

On 2021-04-08 11:38 p.m., Jan Beulich wrote:
> On 09.04.2021 08:08, Ankur Arora wrote:
>> I'm working on somewhat related optimizations on Linux (clear_page(),
>> going in the opposite direction, from REP STOSB to MOVNT) and have
>> some comments/questions below.
> 
> Interesting.
> 
>> On 4/8/2021 6:58 AM, Jan Beulich wrote:
>>> All,
>>>
>>> since over the years we've been repeatedly talking of changing the
>>> implementation of these fundamental functions, I've taken some time
>>> to do some measurements (just for possible clear_page() alternatives
>>> to keep things manageable). I'm not sure I want to spend as much time
>>> subsequently on memcpy() / copy_page() (or more, because there are
>>> yet more combinations of arguments to consider), so for the moment I
>>> think the route we're going to pick here is going to more or less
>>> also apply to those.
>>>
>>> The present copy_page() is the way it is because of the desire to
>>> avoid disturbing the cache. The effect of REP STOS on the L1 cache
>>> (compared to the present use of MOVNTI) is more or less noticable on
>>> all hardware, and at least on Intel hardware more noticable when the
>>> cache starts out clean. For L2 the results are more mixed when
>>> comparing cache-clean and cache-filled cases, but the difference
>>> between MOVNTI and REP STOS remains or (at least on Zen2 and older
>>> Intel hardware) becomes more prominent.
>>
>> Could you give me any pointers on the cache-effects on this? This
>> obviously makes sense but I couldn't come up with any benchmarks
>> which would show this in a straight-forward fashion.
> 
> No benchmarks in that sense, but a local debugging patch measuring
> things before bringing up APs, to have a reasonably predictable
> environment. I have attached it for your reference.

Thanks, that does look like a pretty good predictable test.
(Btw, there might be an oversight in the clear_page_clzero() logic.
I believe that also needs an sfence.)

Just curious: you had commented out the local irq disable/enable clauses.
Is that because you decided that it the code ran at an early enough
point that they were not required or some other reason?

> 
>>> Otoh REP STOS, as was to be expected, in most cases has meaningfully
>>> lower latency than MOVNTI.
>>>
>>> Because I was curious I also included AVX (32-byte stores), AVX512
>>> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
>>> clear win except on the vendors' first generations implementing it
>>> (but I've left out any playing with CR0.TS, which is what I expect
>>> would take this out as an option), AVX512 isn't on Skylake (perhaps
>>> newer hardware does better). CLZERO has slightly higher impact on
>>> L1 than MOVNTI, but lower than REP STOS.
>>
>> Could you elaborate on what kind of difference in L1 impact you are
>> talking about? Evacuation of cachelines?
> 
> Replacement of ones, yes. As you may see from that patch, I prefill
> the cache, do the clearing, and then measure how much longer the
> same operation takes that was used for prefilling. If the clearing
> left the cache completely alone (or if the hw prefetcher was really
> good), there would be no difference.

Yeah, that does sound like a good way to get an idea of how much the
clear_page_x() does perturb the cache.

> 
>>> Its latency is between
>>> both when the caches are warm, and better than both when the caches
>>> are cold.
>>>
>>> Therefore I think that we want to distinguish page clearing (where
>>> we care about latency) from (background) page scrubbing (where I
>>> think the goal ought to be to avoid disturbing the caches). That
>>> would make it
>>> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>>>    synchronous scrubbing),
>>> - MOVNTI for scrub_page() (when done from idle context), unless
>>>    CLZERO is available.
>>> Whether in addition we should take into consideration activity of
>>> other (logical) CPUs sharing caches I don't know - this feels like
>>> it could get complex pretty quickly.
>>
>> The one other case might be for ~L3 (or larger) regions. In my tests,
>> MOVNT/CLZERO is almost always better (the one exception being Skylake)
>> wrt both cache and latency for larger extents.
> 
> Good to know - will keep this in mind.
> 
>> In the particular cases I was looking at (mmap+MAP_POPULATE and
>> page-fault path), that makes the choice of always using MOVNT/CLZERO
>> easy for GB pages, but fuzzier for 2MB pages.
>>
>> Not sure if the large-page case is interesting for you though.
> 
> Well, we never fill large pages in one go, yet the scrubbing may
> touch many individual pages in close succession. But for the
> (background) scrubbing my recommendation is to use MOVNT/CLZERO
> anyway, irrespective of volume. While upon large page allocations
> we may also end up scrubbing many pages in close succession, I'm
> not sure that's worth optimizing for - we at least hope for the
> pages to have got scrubbed in the background before they get
> re-used. Plus we don't (currently) know up front how many of them
> may still need scrubbing; this isn't difficult to at least
> estimate, but may require yet another loop over the constituent
> pages.

Agreed MOVNT/CLZERO do seem ideally suited for background scrubbing.
Alas, AFAICS Linux currently only does foreground cleaning. The
only reason for I can think of for that "decision" is maybe that
there one trusted user with a significant footprint -- the page
cache -- where pages can be allocate without needing to clear.

That said, given that background scrubbing is a fairly cheap way of
time-shifting work to idle without negatively affecting the cache
it does make sense to move towards it for at least a subset of pages.

The only potential negative could be higher power consumption
because idle is spending less time in C-states. That said, that
also seems like a wash given that this only shifts when we do
the clearing.
Would you have any intuition on, if the power consumption of
the non-temporal primitives is meaningfully different from
REP STOS and friends?

Ankur

> 
> Jan
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-09 21:01     ` Ankur Arora
@ 2021-04-12  9:15       ` Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-04-12  9:15 UTC (permalink / raw)
  To: Ankur Arora; +Cc: andrew.cooper3, roger.pau, xen-devel

On 09.04.2021 23:01, Ankur Arora wrote:
> On 2021-04-08 11:38 p.m., Jan Beulich wrote:
>> On 09.04.2021 08:08, Ankur Arora wrote:
>>> On 4/8/2021 6:58 AM, Jan Beulich wrote:
>>>> The present copy_page() is the way it is because of the desire to
>>>> avoid disturbing the cache. The effect of REP STOS on the L1 cache
>>>> (compared to the present use of MOVNTI) is more or less noticable on
>>>> all hardware, and at least on Intel hardware more noticable when the
>>>> cache starts out clean. For L2 the results are more mixed when
>>>> comparing cache-clean and cache-filled cases, but the difference
>>>> between MOVNTI and REP STOS remains or (at least on Zen2 and older
>>>> Intel hardware) becomes more prominent.
>>>
>>> Could you give me any pointers on the cache-effects on this? This
>>> obviously makes sense but I couldn't come up with any benchmarks
>>> which would show this in a straight-forward fashion.
>>
>> No benchmarks in that sense, but a local debugging patch measuring
>> things before bringing up APs, to have a reasonably predictable
>> environment. I have attached it for your reference.
> 
> Thanks, that does look like a pretty good predictable test.
> (Btw, there might be an oversight in the clear_page_clzero() logic.
> I believe that also needs an sfence.)

Oh, good point.

> Just curious: you had commented out the local irq disable/enable clauses.
> Is that because you decided that it the code ran at an early enough
> point that they were not required or some other reason?

It's not so much "early enough to not be required" but "too early to
be valid to enable interrupts". And then I didn't want to switch to
save/restore, so left them just as comments.

> Would you have any intuition on, if the power consumption of
> the non-temporal primitives is meaningfully different from
> REP STOS and friends?

If power can be saved when caches don't get modified (no idea if
that's possible, as the cached data still requires keeping intact),
then non-temporal stores might be better.

Jan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-08 13:58 x86: memset() / clear_page() / page scrubbing Jan Beulich
  2021-04-09  6:08 ` Ankur Arora
@ 2021-04-13 13:17 ` Andrew Cooper
  2021-04-14  8:12   ` Jan Beulich
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2021-04-13 13:17 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Roger Pau Monné

On 08/04/2021 14:58, Jan Beulich wrote:
> All,
>
> since over the years we've been repeatedly talking of changing the
> implementation of these fundamental functions, I've taken some time
> to do some measurements (just for possible clear_page() alternatives
> to keep things manageable). I'm not sure I want to spend as much time
> subsequently on memcpy() / copy_page() (or more, because there are
> yet more combinations of arguments to consider), so for the moment I
> think the route we're going to pick here is going to more or less
> also apply to those.
>
> The present copy_page() is the way it is because of the desire to
> avoid disturbing the cache. The effect of REP STOS on the L1 cache
> (compared to the present use of MOVNTI) is more or less noticable on
> all hardware, and at least on Intel hardware more noticable when the
> cache starts out clean. For L2 the results are more mixed when
> comparing cache-clean and cache-filled cases, but the difference
> between MOVNTI and REP STOS remains or (at least on Zen2 and older
> Intel hardware) becomes more prominent.
>
> Otoh REP STOS, as was to be expected, in most cases has meaningfully
> lower latency than MOVNTI.
>
> Because I was curious I also included AVX (32-byte stores), AVX512
> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
> clear win except on the vendors' first generations implementing it
> (but I've left out any playing with CR0.TS, which is what I expect
> would take this out as an option), AVX512 isn't on Skylake (perhaps
> newer hardware does better). CLZERO has slightly higher impact on
> L1 than MOVNTI, but lower than REP STOS. Its latency is between
> both when the caches are warm, and better than both when the caches
> are cold.
>
> Therefore I think that we want to distinguish page clearing (where
> we care about latency) from (background) page scrubbing (where I
> think the goal ought to be to avoid disturbing the caches). That
> would make it
> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>   synchronous scrubbing),
> - MOVNTI for scrub_page() (when done from idle context), unless
>   CLZERO is available.
> Whether in addition we should take into consideration activity of
> other (logical) CPUs sharing caches I don't know - this feels like
> it could get complex pretty quickly.
>
> For memset() we already simply use REP STOSB. I don't see a strong
> need to change that, but it may be worth to consider bringing it
> closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
> They perform somewhat better in a number of cases (including when
> ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
> what I would have expected). We may want to put the whole thing in
> a .S file though, seeing that the C function right now consists of
> little more than an asm().
>
> For memcpy() I'm inclined to suggest that we simply use REP MOVSB
> on ERMS hardware, and stay with what we have everywhere else.
>
> copy_page() (or really copy_domain_page()) doesn't have many uses,
> so I'm not sure how worthwhile it is to do much optimization there.
> It might be an option to simply expand it to memcpy(), like Arm
> does.
>
> Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
> may want to figure out whether using these for strlen(), strcmp(),
> strchr(), memchr(), and/or memcmp() would be a win.
>
> Thoughts anyone, before I start creating actual patches?

Do you have actual numbers from these experiments?  I've seen your patch
from the thread, but at a minimum its missing some hunks adding new
CPUID bits.  I do worry however whether the testing is likely to be
realistic for non-idle scenarios.

It is very little surprise that AVX-512 on Skylake is poor.  The
frequency hit from using %zmm is staggering.  IceLake is expected to be
better, but almost certainly won't exceed REP MOVSB, which is optimised
in microcode for the data width of the CPU.

For memset(), please don't move in the direction of memcpy().  memcpy()
is problematic because the common case is likely to be a multiple of 8
bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
avoiding.  The "Fast Zero length $FOO" bits on future parts indicate
when passing %ecx=0 is likely to be faster than branching around the
invocation.

With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
cleverness about larger word sizes.  The Linux forms do this fairly well
already, and probably better than Xen, although there might be some room
for improvement IMO.

It is worth nothing that we have extra variations of memset/memcpy where
__builtin_memcpy() gets expanded inline, and the result is a
compiler-chosen sequence, and doesn't hit any of our optimised
sequences.  I'm not sure what to do about this, because there is surely
a larger win from the cases which can be turned into a single mov, or an
elided store/copy, than using a potentially inefficient sequence in the
rare cases.  Maybe there is room for a fine-tuning option to say "just
call memset() if you're going to expand it inline".


For all set/copy operations, whether you want non-temporal or not
depends on when/where the lines are next going to be consumed.  Page
scrubbing in idle context is the only example I can think of where we
aren't plausibly going to consume the destination imminently.  Even
clear/copy page in a hypercall doesn't want to be non-temporal, because
chances are good that the vcpu is going to touch the page on return.

~Andrew



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-13 13:17 ` Andrew Cooper
@ 2021-04-14  8:12   ` Jan Beulich
  2021-04-15 16:21     ` Andrew Cooper
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2021-04-14  8:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5516 bytes --]

On 13.04.2021 15:17, Andrew Cooper wrote:
> Do you have actual numbers from these experiments?

Attached is the collected raw output from a number of systems.

>  I've seen your patch
> from the thread, but at a minimum its missing some hunks adding new
> CPUID bits.

It's not missing hunks - these additions are in a prereq patch that
I meant to post together with whatever this analysis would lead to.
If you think I should submit the prereqs ahead of time, I can of
course do so.

>  I do worry however whether the testing is likely to be
> realistic for non-idle scenarios.

Of course it's not going to be - in non-idle scenarios we'll always
be somewhere in the middle. Therefore I wanted to have numbers at
the edges (hot and cold cache respectively), as any other numbers
are going to be much harder to obtain in a way that they would
actually be meaningful (and hence reasonably stable).

> It is very little surprise that AVX-512 on Skylake is poor.  The
> frequency hit from using %zmm is staggering.  IceLake is expected to be
> better, but almost certainly won't exceed REP MOVSB, which is optimised
> in microcode for the data width of the CPU.

Right, much like AVX has improved but didn't get anywhere near
REP MOVS.

> For memset(), please don't move in the direction of memcpy().  memcpy()
> is problematic because the common case is likely to be a multiple of 8
> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
> avoiding.

And you say this despite me having pointed out that REP STOSL may
be faster in a number of cases? Or do you mean to suggest we should
branch around the trailing REP {MOV,STO}SB?

>  The "Fast Zero length $FOO" bits on future parts indicate
> when passing %ecx=0 is likely to be faster than branching around the
> invocation.

IOW down the road we could use alternatives patching to remove such
branches. But this of course is only if we don't end up using
exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
suggesting ...

> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
> cleverness about larger word sizes.  The Linux forms do this fairly well
> already, and probably better than Xen, although there might be some room
> for improvement IMO.

... here.

As to the Linux implementations - for memcpy_erms() I don't think
I see any room for improvement in the function itself. We could do
alternatives patching somewhat differently (and I probably would).
For memset_erms() the tiny bit of improvement over Linux'es code
that I would see is to avoid the partial register access when
loading %al. But to be honest - in both cases I wouldn't have
bothered looking at their code anyway, if you hadn't pointed me
there.

> It is worth nothing that we have extra variations of memset/memcpy where
> __builtin_memcpy() gets expanded inline, and the result is a
> compiler-chosen sequence, and doesn't hit any of our optimised
> sequences.  I'm not sure what to do about this, because there is surely
> a larger win from the cases which can be turned into a single mov, or an
> elided store/copy, than using a potentially inefficient sequence in the
> rare cases.  Maybe there is room for a fine-tuning option to say "just
> call memset() if you're going to expand it inline".

You mean "just call memset() instead of expanding it inline"?

If the inline expansion is merely REP STOS, I'm not sure we'd
actually gain anything from keeping the compiler from expanding it
inline. But if the inline construct was more complicated (as I
observe e.g. in map_vcpu_info() with gcc 10), then it would likely
be nice if there was such a control. I'll take note to see if I
can find anything.

But this isn't relevant for {clear,copy}_page().

> For all set/copy operations, whether you want non-temporal or not
> depends on when/where the lines are next going to be consumed.  Page
> scrubbing in idle context is the only example I can think of where we
> aren't plausibly going to consume the destination imminently.  Even
> clear/copy page in a hypercall doesn't want to be non-temporal, because
> chances are good that the vcpu is going to touch the page on return.

I'm afraid the situation isn't as black-and-white. Take HAP or
IOMMU page table allocations, for example: They need to clear the
full page, yes. But often this is just to then insert one single
entry, i.e. re-use exactly one of the cache lines. Or take initial
population of guest RAM: The larger the guest, the less likely it
is for every individual page to get accessed again before its
contents get evicted from the caches. Judging from what Ankur said,
once we get to around L3 capacity, MOVNT / CLZERO may be preferable
there.

I think in cases where we don't know how the page is going to be
used subsequently, we ought to favor latency over cache pollution
avoidance. But in cases where we know the subsequent usage pattern,
we may want to direct scrubbing / zeroing accordingly. Yet of
course it's not very helpful that there's no way to avoid
polluting caches and still have reasonably low latency, so using
some heuristics may be unavoidable.

And of course another goal of mine would be to avoid double zeroing
of pages: When scrubbing uses clear_page() anyway, there's no point
in the caller then calling clear_page() again. IMO, just like we
have xzalloc(), we should also have MEMF_zero. Internally the page
allocator can know whether a page was already scrubbed, and it
does know for sure whether scrubbing means zeroing.

Jan

[-- Attachment #2: xen-clear-page.txt --]
[-- Type: text/plain, Size: 18093 bytes --]

Aorus (Skylake):

(XEN) erms=1 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=1024k
(XEN) L1 w/o flush:
(XEN)  pre=5aa sse2=17c8 post=466
(XEN)  pre=302 stosb=544 post=6f2
(XEN)  pre=2f6 stosl=4de post=500
(XEN)  pre=308 stosq=4bc post=4b6
(XEN)  pre=300 avx=14d4 post=2fa
(XEN)  pre=2ea avx512=11ca post=300
(XEN)  pre=32c sse2=1620 post=330
(XEN)  pre=326 stosb=55a post=4b0
(XEN)  pre=332 stosl=4f2 post=4a2
(XEN)  pre=336 stosq=4ec post=47c
(XEN)  pre=332 avx=14f4 post=324
(XEN)  pre=3a2 avx512=1204 post=35c
(XEN)  pre=322 sse2=1606 post=330
(XEN)  pre=324 stosb=564 post=466
(XEN)  pre=31e stosl=4f8 post=49c
(XEN)  pre=322 stosq=4fa post=3e0
(XEN)  pre=340 avx=14f6 post=328
(XEN)  pre=326 avx512=120c post=322
(XEN) L1 w/ flush:
(XEN)  pre=2e4 sse2=c00 post=3e6
(XEN)  pre=34c stosb=916 post=722
(XEN)  pre=358 stosl=908 post=7b4
(XEN)  pre=360 stosq=a72 post=732
(XEN)  pre=33e avx=b3c post=33c
(XEN)  pre=348 avx512=a38 post=342
(XEN)  pre=342 sse2=c24 post=33e
(XEN)  pre=34e stosb=998 post=77c
(XEN)  pre=352 stosl=910 post=6e4
(XEN)  pre=356 stosq=94c post=74a
(XEN)  pre=334 avx=b44 post=332
(XEN)  pre=36e avx512=bca post=336
(XEN)  pre=356 sse2=c1a post=336
(XEN)  pre=35c stosb=92a post=6f0
(XEN)  pre=32e stosl=970 post=864
(XEN)  pre=358 stosq=94c post=756
(XEN)  pre=344 avx=b4c post=326
(XEN)  pre=34c avx512=a5c post=372
(XEN) L2 w/o flush:
(XEN)  pre=15f7c sse2=2eff8 post=c272
(XEN)  pre=cf8c stosb=cbf6 post=c6a4
(XEN)  pre=ce5c stosl=cc7e post=c6bc
(XEN)  pre=d3b6 stosq=7f5e6 post=d898
(XEN)  pre=cf56 avx=2d7de post=be1a
(XEN)  pre=cfe6 avx512=349c6 post=caf8
(XEN)  pre=dcee sse2=2f93e post=c97e
(XEN)  pre=dd6e stosb=d000 post=d102
(XEN)  pre=dad0 stosl=d034 post=d12e
(XEN)  pre=db00 stosq=d0ee post=d0b2
(XEN)  pre=dabc avx=2dec8 post=c830
(XEN)  pre=dc04 avx512=2dbbe post=c8aa
(XEN)  pre=db74 sse2=2f8e6 post=c89e
(XEN)  pre=dd4c stosb=d0a6 post=d16c
(XEN)  pre=da6c stosl=cfd0 post=d388
(XEN)  pre=d8c8 stosq=d054 post=d0b4
(XEN)  pre=db2e avx=2de78 post=cb3c
(XEN)  pre=d9ea avx512=2d9d6 post=c8f0
(XEN) L2 w/ flush:
(XEN)  pre=16000 sse2=16cf2 post=bfc4
(XEN)  pre=1604c stosb=12ab8 post=c66c
(XEN)  pre=16054 stosl=12624 post=c7a6
(XEN)  pre=16008 stosq=127b4 post=c54e
(XEN)  pre=15f7c avx=15a98 post=bd50
(XEN)  pre=16046 avx512=15760 post=13c52
(XEN)  pre=15f8a sse2=16dc0 post=bfb8
(XEN)  pre=15fb4 stosb=1293a post=c6da
(XEN)  pre=15f7c stosl=12672 post=c574
(XEN)  pre=15fee stosq=1245e post=c6fe
(XEN)  pre=15fc8 avx=15aae post=c01c
(XEN)  pre=1608c avx512=1ca32 post=c9ce
(XEN)  pre=15fba sse2=16cdc post=c076
(XEN)  pre=15ffe stosb=12992 post=c9b0
(XEN)  pre=16050 stosl=1290a post=c53e
(XEN)  pre=16002 stosq=129a6 post=c540
(XEN)  pre=15f98 avx=159ee post=bc50
(XEN)  pre=15fca avx512=159bc post=13d9a


Rome:

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=512k
(XEN) L1 w/o flush:
(XEN)  pre=4c4 sse2=eec post=384
(XEN)  pre=35c stosb=230 post=398
(XEN)  pre=35c stosl=230 post=3d4
(XEN)  pre=35c stosq=258 post=410
(XEN)  pre=370 avx=dd4 post=370
(XEN)  pre=370 clzero=758 post=35c
(XEN)  pre=35c sse2=e10 post=370
(XEN)  pre=35c stosb=21c post=370
(XEN)  pre=35c stosl=230 post=3fc
(XEN)  pre=35c stosq=21c post=3ac
(XEN)  pre=35c avx=d98 post=35c
(XEN)  pre=35c clzero=758 post=35c
(XEN)  pre=35c sse2=e24 post=35c
(XEN)  pre=35c stosb=21c post=3d4
(XEN)  pre=35c stosl=21c post=3d4
(XEN)  pre=370 stosq=21c post=3ac
(XEN)  pre=35c avx=d84 post=35c
(XEN)  pre=35c clzero=758 post=370
(XEN) L1 w/ flush:
(XEN)  pre=438 sse2=a50 post=35c
(XEN)  pre=438 stosb=d34 post=398
(XEN)  pre=438 stosl=d0c post=384
(XEN)  pre=438 stosq=aa0 post=384
(XEN)  pre=44c avx=924 post=370
(XEN)  pre=44c clzero=5f0 post=370
(XEN)  pre=438 sse2=a50 post=35c
(XEN)  pre=438 stosb=c30 post=398
(XEN)  pre=44c stosl=d20 post=3c0
(XEN)  pre=438 stosq=b04 post=370
(XEN)  pre=438 avx=938 post=370
(XEN)  pre=44c clzero=6f4 post=35c
(XEN)  pre=438 sse2=a3c post=35c
(XEN)  pre=44c stosb=adc post=384
(XEN)  pre=438 stosl=aa0 post=3c0
(XEN)  pre=44c stosq=a3c post=370
(XEN)  pre=438 avx=924 post=35c
(XEN)  pre=438 clzero=5c8 post=370
(XEN) L2 w/o flush:
(XEN)  pre=670c sse2=fe88 post=108ec
(XEN)  pre=6e28 stosb=24cc post=14500
(XEN)  pre=7120 stosl=2468 post=14d0c
(XEN)  pre=7490 stosq=247c post=1507c
(XEN)  pre=7a6c avx=fc6c post=119cc
(XEN)  pre=72b0 clzero=73f0 post=118b4
(XEN)  pre=7184 sse2=fdfc post=11e2c
(XEN)  pre=6f04 stosb=247c post=14b90
(XEN)  pre=7288 stosl=2530 post=15054
(XEN)  pre=75d0 stosq=24a4 post=15b30
(XEN)  pre=6fe0 avx=fc94 post=11864
(XEN)  pre=7198 clzero=74cc post=11d50
(XEN)  pre=751c sse2=fdfc post=121b0
(XEN)  pre=7350 stosb=24cc post=15360
(XEN)  pre=6e64 stosl=24b8 post=14f00
(XEN)  pre=7738 stosq=2440 post=14a8c
(XEN)  pre=6f90 avx=fcf8 post=11bc0
(XEN)  pre=729c clzero=747c post=11ae4
(XEN) L2 w/ flush:
(XEN)  pre=580c sse2=a870 post=10554
(XEN)  pre=5744 stosb=9c7c post=152ac
(XEN)  pre=5924 stosl=9a24 post=15c48
(XEN)  pre=56cc stosq=9df8 post=157fc
(XEN)  pre=5898 avx=a640 post=10338
(XEN)  pre=5974 clzero=69dc post=10dec
(XEN)  pre=5be0 sse2=a870 post=10ba8
(XEN)  pre=57a8 stosb=9ed4 post=15a40
(XEN)  pre=594c stosl=9d6c post=16198
(XEN)  pre=5438 stosq=9dd0 post=15860
(XEN)  pre=57d0 avx=a49c post=10b80
(XEN)  pre=52bc clzero=69dc post=f938
(XEN)  pre=56e0 sse2=ab54 post=10b08
(XEN)  pre=5654 stosb=9f88 post=1584c
(XEN)  pre=5654 stosl=a014 post=14ab4
(XEN)  pre=58c0 stosq=9a38 post=15dc4
(XEN)  pre=57a8 avx=a640 post=10c0c
(XEN)  pre=5618 clzero=69dc post=10554


Precision 7810 (Haswell):

(XEN) erms=1 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=256k
(XEN) L1 w/o flush:
(XEN)  pre=618 sse2=1324 post=41c
(XEN)  pre=3c4 stosb=6fc post=74c
(XEN)  pre=3ac stosl=6c4 post=728
(XEN)  pre=39c stosq=6b0 post=720
(XEN)  pre=3ac avx=df4 post=3e4
(XEN)  pre=38c sse2=f4c post=3a8
(XEN)  pre=38c stosb=6e4 post=748
(XEN)  pre=390 stosl=698 post=6f8
(XEN)  pre=380 stosq=6ac post=6ec
(XEN)  pre=3a4 avx=e28 post=3a8
(XEN)  pre=384 sse2=f50 post=374
(XEN)  pre=398 stosb=6ec post=6d4
(XEN)  pre=380 stosl=69c post=700
(XEN)  pre=3b8 stosq=698 post=6cc
(XEN)  pre=394 avx=e64 post=390
(XEN) L1 w/ flush:
(XEN)  pre=49c sse2=109c post=380
(XEN)  pre=480 stosb=1c08 post=864
(XEN)  pre=4d0 stosl=1bc8 post=820
(XEN)  pre=488 stosq=1bb8 post=834
(XEN)  pre=3ac avx=ddc post=388
(XEN)  pre=498 sse2=ef8 post=384
(XEN)  pre=474 stosb=1cb0 post=85c
(XEN)  pre=4a4 stosl=1bc4 post=85c
(XEN)  pre=47c stosq=1bcc post=828
(XEN)  pre=480 avx=df0 post=38c
(XEN)  pre=498 sse2=f08 post=370
(XEN)  pre=480 stosb=1ed4 post=880
(XEN)  pre=47c stosl=1bb0 post=848
(XEN)  pre=48c stosq=1ba0 post=850
(XEN)  pre=488 avx=de4 post=394
(XEN) L2 w/o flush:
(XEN)  pre=6450 sse2=7f78 post=39c8
(XEN)  pre=5478 stosb=3ab8 post=4b74
(XEN)  pre=4f68 stosl=3978 post=4d84
(XEN)  pre=4ca0 stosq=395c post=4e60
(XEN)  pre=52b4 avx=7974 post=3c84
(XEN)  pre=4fa8 sse2=7f24 post=3a80
(XEN)  pre=5118 stosb=3ad8 post=4e18
(XEN)  pre=4df0 stosl=3908 post=4ce8
(XEN)  pre=5028 stosq=396c post=4ef0
(XEN)  pre=5110 avx=7968 post=3ba4
(XEN)  pre=5088 sse2=7f20 post=3b1c
(XEN)  pre=4db8 stosb=3908 post=4ec4
(XEN)  pre=4eb4 stosl=3a00 post=4c00
(XEN)  pre=4f90 stosq=3970 post=4d98
(XEN)  pre=4f3c avx=7950 post=3a78
(XEN) L2 w/ flush:
(XEN)  pre=6380 sse2=786c post=3948
(XEN)  pre=6400 stosb=10478 post=4740
(XEN)  pre=6430 stosl=10564 post=46cc
(XEN)  pre=6430 stosq=10608 post=46c4
(XEN)  pre=6498 avx=7548 post=3978
(XEN)  pre=6418 sse2=7868 post=3934
(XEN)  pre=6350 stosb=10988 post=4798
(XEN)  pre=6410 stosl=10508 post=4678
(XEN)  pre=63dc stosq=105a8 post=46fc
(XEN)  pre=6500 avx=7564 post=39d0
(XEN)  pre=63b0 sse2=7890 post=397c
(XEN)  pre=648c stosb=10868 post=47f0
(XEN)  pre=64a0 stosl=106f4 post=46b4
(XEN)  pre=646c stosq=10468 post=4734
(XEN)  pre=63ec avx=75c4 post=3938


Dinar:

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=16k l2=2048k
(XEN) L1 w/o flush:
(XEN)  pre=7e6 sse2=1c06 post=79d
(XEN)  pre=70a stosb=668 post=84f
(XEN)  pre=6dc stosl=676 post=83f
(XEN)  pre=6cf stosq=65b post=872
(XEN)  pre=6e0 avx=1a84 post=706
(XEN)  pre=709 sse2=19aa post=6ce
(XEN)  pre=6b7 stosb=601 post=844
(XEN)  pre=6e8 stosl=613 post=85e
(XEN)  pre=6a1 stosq=614 post=824
(XEN)  pre=6b9 avx=1a66 post=695
(XEN)  pre=6e2 sse2=199b post=6af
(XEN)  pre=6e7 stosb=602 post=839
(XEN)  pre=6cc stosl=61b post=845
(XEN)  pre=6ad stosq=607 post=815
(XEN)  pre=6ac avx=1a81 post=693
(XEN) L1 w/ flush:
(XEN)  pre=804 sse2=c48 post=6da
(XEN)  pre=7ca stosb=e16 post=82b
(XEN)  pre=7a3 stosl=ef0 post=81e
(XEN)  pre=7d7 stosq=dde post=829
(XEN)  pre=7ae avx=1562 post=6c0
(XEN)  pre=7c9 sse2=c3a post=6d8
(XEN)  pre=7ec stosb=db0 post=82b
(XEN)  pre=7f0 stosl=e3e post=84d
(XEN)  pre=7f1 stosq=de8 post=827
(XEN)  pre=7dd avx=157a post=6bd
(XEN)  pre=7d2 sse2=c49 post=6c4
(XEN)  pre=7a4 stosb=dfe post=848
(XEN)  pre=7ce stosl=e8c post=831
(XEN)  pre=7b3 stosq=daa post=81d
(XEN)  pre=7f8 avx=156b post=6d0
(XEN) L2 w/o flush:
(XEN)  pre=5e24f sse2=7ff69 post=40af6
(XEN)  pre=3c515 stosb=4ddc7 post=9f3bf
(XEN)  pre=3cfb9 stosl=4da3c post=9efcb
(XEN)  pre=3bc5c stosq=4dbd3 post=9ec1c
(XEN)  pre=3c927 avx=a6cc0 post=42aa1
(XEN)  pre=3cf6d sse2=7fe95 post=4223d
(XEN)  pre=3c55f stosb=4e035 post=9f25d
(XEN)  pre=3cd63 stosl=4dd8b post=9f14f
(XEN)  pre=3b8d3 stosq=4de1f post=9f050
(XEN)  pre=3c66f avx=a6cad post=43886
(XEN)  pre=3c990 sse2=7feb9 post=42a6d
(XEN)  pre=3c1a0 stosb=4dd45 post=9f04a
(XEN)  pre=3d0ae stosl=4de64 post=9f02b
(XEN)  pre=3c0ae stosq=4d9dc post=9edb8
(XEN)  pre=3d0b4 avx=a6c97 post=41e67
(XEN) L2 w/ flush:
(XEN)  pre=39194 sse2=55efd post=3a2a9
(XEN)  pre=391cf stosb=5a8bc post=95a1d
(XEN)  pre=3913c stosl=5a5a7 post=8fced
(XEN)  pre=3938b stosq=5a68b post=967d4
(XEN)  pre=38232 avx=9d328 post=3a4fe
(XEN)  pre=393a6 sse2=56027 post=3a2fe
(XEN)  pre=3917a stosb=59f3f post=9518a
(XEN)  pre=390c2 stosl=5a0f3 post=951bc
(XEN)  pre=3922e stosq=5a7f6 post=952db
(XEN)  pre=39443 avx=9d407 post=3a4c4
(XEN)  pre=38635 sse2=55fb8 post=3a557
(XEN)  pre=38237 stosb=5a2fb post=92f3a
(XEN)  pre=3914e stosl=5a8e5 post=8bb48
(XEN)  pre=39058 stosq=5a5dc post=96726
(XEN)  pre=3913c avx=9d33d post=3a2d1


Romley (Sandybridge):

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=256k
(XEN) L1 w/o flush:
(XEN)  pre=954 sse2=2958 post=798
(XEN)  pre=792 stosb=e7c post=af2
(XEN)  pre=732 stosl=b70 post=b28
(XEN)  pre=768 stosq=bdc post=ac2
(XEN)  pre=74a avx=26ac post=750
(XEN)  pre=774 sse2=27d2 post=708
(XEN)  pre=738 stosb=e4c post=ada
(XEN)  pre=714 stosl=b22 post=a98
(XEN)  pre=732 stosq=b34 post=ac2
(XEN)  pre=714 avx=2730 post=714
(XEN)  pre=714 sse2=27d8 post=70e
(XEN)  pre=72c stosb=e3a post=ab0
(XEN)  pre=714 stosl=b04 post=a74
(XEN)  pre=732 stosq=b04 post=a92
(XEN)  pre=714 avx=4fc8 post=714
(XEN) L1 w/ flush:
(XEN)  pre=7c8 sse2=2784 post=708
(XEN)  pre=72c stosb=2100 post=ca8
(XEN)  pre=80a stosl=1ed2 post=c1e
(XEN)  pre=7f2 stosq=2052 post=c90
(XEN)  pre=714 avx=2652 post=714
(XEN)  pre=7d4 sse2=2772 post=732
(XEN)  pre=7c8 stosb=2466 post=be2
(XEN)  pre=828 stosl=2004 post=c72
(XEN)  pre=7d4 stosq=20b2 post=c96
(XEN)  pre=81c avx=2682 post=714
(XEN)  pre=7d4 sse2=2754 post=72c
(XEN)  pre=7c8 stosb=2358 post=bca
(XEN)  pre=828 stosl=1ecc post=c48
(XEN)  pre=7c8 stosq=20b8 post=c00
(XEN)  pre=81c avx=26f4 post=714
(XEN) L2 w/o flush:
(XEN)  pre=9cf6 sse2=14b9e post=5706
(XEN)  pre=7ce0 stosb=6f00 post=74a6
(XEN)  pre=78ea stosl=5e26 post=79c8
(XEN)  pre=7926 stosq=5ec2 post=7848
(XEN)  pre=7920 avx=1410c post=5c70
(XEN)  pre=7bde sse2=14a06 post=5dea
(XEN)  pre=7ab2 stosb=6dda post=78c0
(XEN)  pre=7a6a stosl=5f34 post=792c
(XEN)  pre=7752 stosq=6054 post=7bfc
(XEN)  pre=7974 avx=14172 post=5de4
(XEN)  pre=7a76 sse2=14a54 post=5dc0
(XEN)  pre=77d6 stosb=6cd8 post=779a
(XEN)  pre=774c stosl=5dcc post=7c38
(XEN)  pre=788a stosq=5e62 post=7a04
(XEN)  pre=7722 avx=16aca post=5e2c
(XEN) L2 w/ flush:
(XEN)  pre=9cea sse2=14172 post=571e
(XEN)  pre=9c3c stosb=113e2 post=6d50
(XEN)  pre=9d56 stosl=10926 post=6ca8
(XEN)  pre=9ca2 stosq=10950 post=6db6
(XEN)  pre=9d44 avx=13b06 post=5700
(XEN)  pre=9df8 sse2=141cc post=56a6
(XEN)  pre=9cc0 stosb=112a4 post=6ca8
(XEN)  pre=9d50 stosl=109c8 post=6ca2
(XEN)  pre=9c84 stosq=10a10 post=6cf0
(XEN)  pre=9c84 avx=13b30 post=56e8
(XEN)  pre=9cde sse2=141ea post=579c
(XEN)  pre=9c7e stosb=11370 post=6c2a
(XEN)  pre=9d44 stosl=108de post=6c3c
(XEN)  pre=9bf4 stosq=1096e post=6ccc
(XEN)  pre=9c7e avx=13b18 post=56ac


Westmere:

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=256k
(XEN) L1 w/o flush:
(XEN)  pre=1184 sse2=2058 post=c60
(XEN)  pre=ad4 stosb=b60 post=1a24
(XEN)  pre=9d4 stosl=874 post=1348
(XEN)  pre=9e8 stosq=8d4 post=dd0
(XEN)  pre=9dc sse2=1dfc post=9e8
(XEN)  pre=9e8 stosb=a6c post=da4
(XEN)  pre=9d4 stosl=854 post=dd4
(XEN)  pre=9e8 stosq=8a4 post=d3c
(XEN)  pre=9d8 sse2=1e1c post=9ec
(XEN)  pre=9e8 stosb=a44 post=cc8
(XEN)  pre=9d4 stosl=81c post=d0c
(XEN)  pre=9ec stosq=810 post=cc8
(XEN) L1 w/ flush:
(XEN)  pre=b18 sse2=196c post=a84
(XEN)  pre=b08 stosb=15b8 post=116c
(XEN)  pre=b10 stosl=1440 post=163c
(XEN)  pre=a48 stosq=13d8 post=13b4
(XEN)  pre=b1c sse2=199c post=a3c
(XEN)  pre=bb8 stosb=15c4 post=12e8
(XEN)  pre=b0c stosl=1324 post=1430
(XEN)  pre=a48 stosq=135c post=12c4
(XEN)  pre=b1c sse2=199c post=a3c
(XEN)  pre=b18 stosb=1818 post=1320
(XEN)  pre=b10 stosl=1324 post=11bc
(XEN)  pre=a48 stosq=135c post=122c
(XEN) L2 w/o flush:
(XEN)  pre=8e20 sse2=f490 post=504c
(XEN)  pre=77a4 stosb=7804 post=6854
(XEN)  pre=778c stosl=7280 post=636c
(XEN)  pre=7594 stosq=7234 post=60c8
(XEN)  pre=70bc sse2=f3c4 post=55e0
(XEN)  pre=7014 stosb=77e8 post=5f68
(XEN)  pre=73f8 stosl=7264 post=62b8
(XEN)  pre=72ec stosq=7208 post=62fc
(XEN)  pre=6d80 sse2=f370 post=51a0
(XEN)  pre=6e34 stosb=7804 post=5f84
(XEN)  pre=7058 stosl=723c post=5fb8
(XEN)  pre=6f1c stosq=725c post=6188
(XEN) L2 w/ flush:
(XEN)  pre=8e48 sse2=cbc4 post=5034
(XEN)  pre=8d5c stosb=999c post=58d0
(XEN)  pre=8da0 stosl=912c post=590c
(XEN)  pre=8c10 stosq=8f80 post=5a0c
(XEN)  pre=8e10 sse2=cbd0 post=5030
(XEN)  pre=8cb0 stosb=9878 post=5960
(XEN)  pre=8de4 stosl=9060 post=58e4
(XEN)  pre=8c0c stosq=8fa0 post=5a10
(XEN)  pre=8d4c sse2=cbd0 post=502c
(XEN)  pre=8cf8 stosb=9834 post=58f0
(XEN)  pre=8de4 stosl=90d0 post=58e4
(XEN)  pre=8c0c stosq=9178 post=5998


Latitude E6410 (Sandybridge):

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=32k l2=256k
(XEN) L1 w/o flush:
(XEN)  pre=68d sse2=3c06 post=460
(XEN)  pre=41f stosb=8a0 post=823
(XEN)  pre=413 stosl=6ae post=789
(XEN)  pre=413 stosq=6e3 post=78f
(XEN)  pre=413 sse2=3989 post=410
(XEN)  pre=422 stosb=81d post=771
(XEN)  pre=413 stosl=675 post=77d
(XEN)  pre=3f9 stosq=667 post=6fb
(XEN)  pre=437 sse2=38b7 post=416
(XEN)  pre=407 stosb=802 post=727
(XEN)  pre=407 stosl=65e post=754
(XEN)  pre=404 stosq=65b post=6ef
(XEN) L1 w/ flush:
(XEN)  pre=5b4 sse2=20ca post=433
(XEN)  pre=55f stosb=15a2 post=861
(XEN)  pre=565 stosl=1252 post=861
(XEN)  pre=559 stosq=1444 post=84d
(XEN)  pre=57c sse2=21ae post=436
(XEN)  pre=55f stosb=157e post=897
(XEN)  pre=56d stosl=1255 post=83b
(XEN)  pre=657 stosq=1282 post=86d
(XEN)  pre=565 sse2=21bd post=43a
(XEN)  pre=57c stosb=153d post=88d
(XEN)  pre=56b stosl=1247 post=87c
(XEN)  pre=573 stosq=1258 post=87f
(XEN) L2 w/o flush:
(XEN)  pre=602b sse2=1d4d4 post=3669
(XEN)  pre=4a6c stosb=4b79 post=44b4
(XEN)  pre=4976 stosl=4383 post=48d0
(XEN)  pre=4d95 stosq=435d post=47ba
(XEN)  pre=4bf3 sse2=1d333 post=39f7
(XEN)  pre=4bed stosb=4b3c post=4671
(XEN)  pre=5003 stosl=435d post=4de8
(XEN)  pre=4f0a stosq=4377 post=4874
(XEN)  pre=4d1e sse2=1d368 post=3e6e
(XEN)  pre=4f25 stosb=4b4a post=47a5
(XEN)  pre=4abf stosl=4316 post=47cc
(XEN)  pre=4f19 stosq=4351 post=48bb
(XEN) L2 w/ flush:
(XEN)  pre=60cb sse2=10310 post=3672
(XEN)  pre=60ce stosb=956c post=436b
(XEN)  pre=603d stosl=8a70 post=438f
(XEN)  pre=5fe1 stosq=876d post=442f
(XEN)  pre=60f8 sse2=103dc post=36aa
(XEN)  pre=6010 stosb=94db post=436e
(XEN)  pre=60fe stosl=8a7c post=43a7
(XEN)  pre=605b stosq=876d post=4473
(XEN)  pre=6093 sse2=10485 post=36b9
(XEN)  pre=604c stosb=93c4 post=43b3
(XEN)  pre=60b3 stosl=8c03 post=435f
(XEN)  pre=607e stosq=895c post=43fc


Tulsa (Fam0f Xeon (7100?)):

(XEN) erms=0 fsrm=0 fzrm=0 fsrs=0 fsrcs=0 l1d=16k l2=1024k
(XEN) L1 w/o flush:
(XEN)  pre=caf sse2=3cd4 post=b39
(XEN)  pre=b28 stosb=192b post=1485
(XEN)  pre=b9f stosl=d7b post=d37
(XEN)  pre=b28 stosq=c6b post=c8d
(XEN)  pre=b17 sse2=3223 post=ae4
(XEN)  pre=a8f stosb=bd2 post=b4a
(XEN)  pre=a8f stosl=aa0 post=af5
(XEN)  pre=ab1 stosq=a8f post=be3
(XEN)  pre=ac2 sse2=3212 post=ae4
(XEN)  pre=a8f stosb=bc1 post=ae4
(XEN)  pre=aa0 stosl=a6d post=ad3
(XEN)  pre=aa0 stosq=a6d post=ae4
(XEN) L1 w/ flush:
(XEN)  pre=b06 sse2=628c post=c27
(XEN)  pre=ae4 stosb=958c post=14eb
(XEN)  pre=b17 stosl=959d post=16a5
(XEN)  pre=b06 stosq=9669 post=15d9
(XEN)  pre=ae4 sse2=6127 post=bc1
(XEN)  pre=a7e stosb=9636 post=15b7
(XEN)  pre=aa0 stosl=92e4 post=1474
(XEN)  pre=a8f stosq=95e1 post=161d
(XEN)  pre=ab1 sse2=62bf post=c05
(XEN)  pre=a8f stosb=96e0 post=180a
(XEN)  pre=a8f stosl=9702 post=15a6
(XEN)  pre=aa0 stosq=9438 post=15d9
(XEN) L2 w/o flush:
(XEN)  pre=5342d sse2=d49ed post=21c8c
(XEN)  pre=25498 stosb=69d4b post=315c5
(XEN)  pre=249b4 stosl=6982e post=3166f
(XEN)  pre=2470c stosq=69e28 post=30bbe
(XEN)  pre=23f25 sse2=d4536 post=1fb14
(XEN)  pre=23e26 stosb=6ad2a post=30a6a
(XEN)  pre=23fcf stosl=68ed1 post=2fd22
(XEN)  pre=23e48 stosq=69be6 post=308e3
(XEN)  pre=23e9d sse2=d4459 post=20c9c
(XEN)  pre=23f69 stosb=6a70e post=30c9b
(XEN)  pre=24035 stosl=69069 post=302e9
(XEN)  pre=258fa stosq=69aa3 post=30cbd
(XEN) L2 w/ flush:
(XEN)  pre=263cd sse2=1306a2 post=21bd1
(XEN)  pre=265fe stosb=27aeb2 post=3177f
(XEN)  pre=26466 stosl=27f516 post=311c9
(XEN)  pre=26004 stosq=27cb40 post=3153d
(XEN)  pre=25641 sse2=13031d post=21b16
(XEN)  pre=26411 stosb=27f57c post=31ecd
(XEN)  pre=262f0 stosl=27b5bc post=315a3
(XEN)  pre=25f49 stosq=27b974 post=312a6
(XEN)  pre=25520 sse2=1310ed post=21b38
(XEN)  pre=2617a stosb=27d107 post=314d7
(XEN)  pre=261ad stosl=27bd81 post=309f3
(XEN)  pre=25ff3 stosq=27dff8 post=3164d

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-14  8:12   ` Jan Beulich
@ 2021-04-15 16:21     ` Andrew Cooper
  2021-04-21 13:55       ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2021-04-15 16:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, xen-devel

On 14/04/2021 09:12, Jan Beulich wrote:
> On 13.04.2021 15:17, Andrew Cooper wrote:
>> Do you have actual numbers from these experiments?
> Attached is the collected raw output from a number of systems.

Wow Tulsa is vintage.  Is that new enough to have nonstop_tsc ?

It's also quite possibly old enough to fail Linux's REP_GOOD check which
is something we're missing in Xen.

>>   I've seen your patch
>> from the thread, but at a minimum its missing some hunks adding new
>> CPUID bits.
> It's not missing hunks - these additions are in a prereq patch that
> I meant to post together with whatever this analysis would lead to.
> If you think I should submit the prereqs ahead of time, I can of
> course do so.

Well - its necessary for anyone wanting to compile your patch.

All the bits seem like they're ones we ought to mirror through to guests
irrespective of optimisations taken in Xen, so I don't see a problem
with such a patch going in stand-alone.

>>   I do worry however whether the testing is likely to be
>> realistic for non-idle scenarios.
> Of course it's not going to be - in non-idle scenarios we'll always
> be somewhere in the middle. Therefore I wanted to have numbers at
> the edges (hot and cold cache respectively), as any other numbers
> are going to be much harder to obtain in a way that they would
> actually be meaningful (and hence reasonably stable).

In non-idle scenarios, all numbers can easily be worse across the board
than as measured.

Cachline snoops, and in particular, repeated snoops during the
clear/copy operation will cause far worse overheads than simply being
cache-cold to begin with.

>> It is very little surprise that AVX-512 on Skylake is poor.  The
>> frequency hit from using %zmm is staggering.  IceLake is expected to be
>> better, but almost certainly won't exceed REP MOVSB, which is optimised
>> in microcode for the data width of the CPU.
> Right, much like AVX has improved but didn't get anywhere near
> REP MOVS.

The other thing I forgot to mention is the legacy/VEX pipeline stall
penalty, which will definitely dwarf short operations, and on some CPUs
doesn't even amortise itself over a 4k operation.

Whether a vector algorithm suffers a stall or not typically depends on
the instructions last executed in guest context.

Furthermore, while on the current CPU you might manage to get a vector
algorithm to be faster than an integer one, you will be forcing a
frequency drop on every other CPU in the socket, and the net hit to the
system can be worse than just using the integer algorithm to being with.


IMO, vector optimised algorithms are a minefield we don't want to go
wandering in, particularly as ERMS is common these days, and here to stay.

>> For memset(), please don't move in the direction of memcpy().  memcpy()
>> is problematic because the common case is likely to be a multiple of 8
>> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
>> avoiding.
> And you say this despite me having pointed out that REP STOSL may
> be faster in a number of cases? Or do you mean to suggest we should
> branch around the trailing REP {MOV,STO}SB?
>
>>   The "Fast Zero length $FOO" bits on future parts indicate
>> when passing %ecx=0 is likely to be faster than branching around the
>> invocation.
> IOW down the road we could use alternatives patching to remove such
> branches. But this of course is only if we don't end up using
> exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
> suggesting ...
>
>> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
>> cleverness about larger word sizes.  The Linux forms do this fairly well
>> already, and probably better than Xen, although there might be some room
>> for improvement IMO.
> ... here.
>
> As to the Linux implementations - for memcpy_erms() I don't think
> I see any room for improvement in the function itself. We could do
> alternatives patching somewhat differently (and I probably would).
> For memset_erms() the tiny bit of improvement over Linux'es code
> that I would see is to avoid the partial register access when
> loading %al. But to be honest - in both cases I wouldn't have
> bothered looking at their code anyway, if you hadn't pointed me
> there.

Answering multiple of the points together.

Yes, the partial register access on %al was one thing I spotted, and
movbzl would be an improvement.  The alternatives are a bit weird, but
they're best as they are IMO.  It makes a useful enough difference to
backtraces/etc, and unconditional jmp's are about as close to free as
you can get these days.

On an ERMS system, we want to use REP MOVSB unilaterally.  It is my
understanding that it is faster across the board than any algorithm
variation trying to use wider accesses.

For non ERMS systems,  the split MOVSQ/MOVSB is still a win, but my
expectation is that conditionally jumping over the latter MOVSB would be
a win.

The "Fast zero length" CPUID bits don't exist for no reason, and while
passing 0 into memcpy/cmp() is exceedingly rare - possibly non-existent
- and not worth optimising, passing a multiple of 8 in probably is worth
optimising.  (Obviously, this depends on the underlying mem*() functions
seeing a multiple of 8 for a meaningful number of their inputs, but I'd
expect this to be the case).

>> It is worth nothing that we have extra variations of memset/memcpy where
>> __builtin_memcpy() gets expanded inline, and the result is a
>> compiler-chosen sequence, and doesn't hit any of our optimised
>> sequences.  I'm not sure what to do about this, because there is surely
>> a larger win from the cases which can be turned into a single mov, or an
>> elided store/copy, than using a potentially inefficient sequence in the
>> rare cases.  Maybe there is room for a fine-tuning option to say "just
>> call memset() if you're going to expand it inline".
> You mean "just call memset() instead of expanding it inline"?

I think want I really mean is "if the result of optimising memset() is
going to result in a REP instruction, call memset() instead".

You want the compiler to do conversion to single mov's/etc, but you
don't want is ...

> If the inline expansion is merely REP STOS, I'm not sure we'd
> actually gain anything from keeping the compiler from expanding it
> inline. But if the inline construct was more complicated (as I
> observe e.g. in map_vcpu_info() with gcc 10), then it would likely
> be nice if there was such a control. I'll take note to see if I
> can find anything.

... this.  What GCC currently expands inline is a REP MOVS{L,Q}, with
the first and final element done manually ahead of the REP, presumably
for prefetching/pagewalk reasons.

The exact sequence varies due to the surrounding code, and while its
probably a decent stab for -O2/3 on "any arbitrary 64bit CPU", its not
helpful when we've got a system-optimised mem*() to hand.

> But this isn't relevant for {clear,copy}_page().
>
>> For all set/copy operations, whether you want non-temporal or not
>> depends on when/where the lines are next going to be consumed.  Page
>> scrubbing in idle context is the only example I can think of where we
>> aren't plausibly going to consume the destination imminently.  Even
>> clear/copy page in a hypercall doesn't want to be non-temporal, because
>> chances are good that the vcpu is going to touch the page on return.
> I'm afraid the situation isn't as black-and-white. Take HAP or
> IOMMU page table allocations, for example: They need to clear the
> full page, yes. But often this is just to then insert one single
> entry, i.e. re-use exactly one of the cache lines.

I consider this an orthogonal problem.  When we're not double-scrubbing
most memory Xen uses, most of this goes away.

Even if we do need to scrub a pagetable to use, we're never(?) complete
at the end of the scrub, and need to make further writes imminently. 
These never want non-temporal accesses, because you never want to write
into recently-evicted line, and there's no plausible way that trying to
mix and match temporal and non-temporal stores is going to be a
worthwhile optimisation to try.

> Or take initial
> population of guest RAM: The larger the guest, the less likely it
> is for every individual page to get accessed again before its
> contents get evicted from the caches. Judging from what Ankur said,
> once we get to around L3 capacity, MOVNT / CLZERO may be preferable
> there.

Initial population of guests doesn't matter at all, because nothing
(outside of the single threaded toolstack process issuing the
construction hypercalls) is going to touch the pages until the VM is
unpaused.  The only async accesses I can think of are xenstored and
xenconsoled starting up, and those are after the RAM is populated.

In cases like this, current might be a good way of choosing between
temporal and non-temporal accesses.

As before, not double scrubbing will further improve things.

> I think in cases where we don't know how the page is going to be
> used subsequently, we ought to favor latency over cache pollution
> avoidance.

I broadly agree.  I think the cases where its reasonably safe to use the
pollution-avoidance are fairly obvious, and there is a steep cost to
wrongly-using non-temporal accesses.

> But in cases where we know the subsequent usage pattern,
> we may want to direct scrubbing / zeroing accordingly. Yet of
> course it's not very helpful that there's no way to avoid
> polluting caches and still have reasonably low latency, so using
> some heuristics may be unavoidable.

I don't think any heuristics beyond current, or possibly
d->creation_finished are going to be worthwhile, but I think these alone
can net us a decent win.

> And of course another goal of mine would be to avoid double zeroing
> of pages: When scrubbing uses clear_page() anyway, there's no point
> in the caller then calling clear_page() again. IMO, just like we
> have xzalloc(), we should also have MEMF_zero. Internally the page
> allocator can know whether a page was already scrubbed, and it
> does know for sure whether scrubbing means zeroing.

I think we've discussed this before.  I'm in favour, but I'm absolutely
certain that that wants be spelled MEMF_dirty (or equiv), so forgetting
it fails safe, and code which is using dirty allocations is clearly
identified and can be audited easily.

~Andrew



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: memset() / clear_page() / page scrubbing
  2021-04-15 16:21     ` Andrew Cooper
@ 2021-04-21 13:55       ` Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-04-21 13:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, xen-devel

On 15.04.2021 18:21, Andrew Cooper wrote:
> On 14/04/2021 09:12, Jan Beulich wrote:
>> On 13.04.2021 15:17, Andrew Cooper wrote:
>>> Do you have actual numbers from these experiments?
>> Attached is the collected raw output from a number of systems.
> 
> Wow Tulsa is vintage.  Is that new enough to have nonstop_tsc ?

No.

>>> For memset(), please don't move in the direction of memcpy().  memcpy()
>>> is problematic because the common case is likely to be a multiple of 8
>>> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
>>> avoiding.
>> And you say this despite me having pointed out that REP STOSL may
>> be faster in a number of cases? Or do you mean to suggest we should
>> branch around the trailing REP {MOV,STO}SB?
>>
>>>   The "Fast Zero length $FOO" bits on future parts indicate
>>> when passing %ecx=0 is likely to be faster than branching around the
>>> invocation.
>> IOW down the road we could use alternatives patching to remove such
>> branches. But this of course is only if we don't end up using
>> exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
>> suggesting ...
>>
>>> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
>>> cleverness about larger word sizes.  The Linux forms do this fairly well
>>> already, and probably better than Xen, although there might be some room
>>> for improvement IMO.
>> ... here.
>>
>> As to the Linux implementations - for memcpy_erms() I don't think
>> I see any room for improvement in the function itself. We could do
>> alternatives patching somewhat differently (and I probably would).
>> For memset_erms() the tiny bit of improvement over Linux'es code
>> that I would see is to avoid the partial register access when
>> loading %al. But to be honest - in both cases I wouldn't have
>> bothered looking at their code anyway, if you hadn't pointed me
>> there.
> 
> Answering multiple of the points together.
> 
> Yes, the partial register access on %al was one thing I spotted, and
> movbzl would be an improvement.  The alternatives are a bit weird, but
> they're best as they are IMO.  It makes a useful enough difference to
> backtraces/etc, and unconditional jmp's are about as close to free as
> you can get these days.
> 
> On an ERMS system, we want to use REP MOVSB unilaterally.  It is my
> understanding that it is faster across the board than any algorithm
> variation trying to use wider accesses.

Not according to the numbers I've collected. There are cases where
clearing a full page via REP STOS{L,Q} is (often just a little)
faster. Whether this also applies to MOVS I can't tell.

>>> It is worth nothing that we have extra variations of memset/memcpy where
>>> __builtin_memcpy() gets expanded inline, and the result is a
>>> compiler-chosen sequence, and doesn't hit any of our optimised
>>> sequences.  I'm not sure what to do about this, because there is surely
>>> a larger win from the cases which can be turned into a single mov, or an
>>> elided store/copy, than using a potentially inefficient sequence in the
>>> rare cases.  Maybe there is room for a fine-tuning option to say "just
>>> call memset() if you're going to expand it inline".
>> You mean "just call memset() instead of expanding it inline"?
> 
> I think want I really mean is "if the result of optimising memset() is
> going to result in a REP instruction, call memset() instead".
> 
> You want the compiler to do conversion to single mov's/etc, but you
> don't want is ...
> 
>> If the inline expansion is merely REP STOS, I'm not sure we'd
>> actually gain anything from keeping the compiler from expanding it
>> inline. But if the inline construct was more complicated (as I
>> observe e.g. in map_vcpu_info() with gcc 10), then it would likely
>> be nice if there was such a control. I'll take note to see if I
>> can find anything.
> 
> ... this.  What GCC currently expands inline is a REP MOVS{L,Q}, with
> the first and final element done manually ahead of the REP, presumably
> for prefetching/pagewalk reasons.

Not sure about the reasons, but the compiler doesn't always do it
like this - there are also cases of plain REP STOSQ. My initial
guess the splitting of the first and last elements was when the
compiler couldn't prove the buffer is 8-byte aligned and a
multiple of 8 bytes in size.

>>> For all set/copy operations, whether you want non-temporal or not
>>> depends on when/where the lines are next going to be consumed.  Page
>>> scrubbing in idle context is the only example I can think of where we
>>> aren't plausibly going to consume the destination imminently.  Even
>>> clear/copy page in a hypercall doesn't want to be non-temporal, because
>>> chances are good that the vcpu is going to touch the page on return.
>> I'm afraid the situation isn't as black-and-white. Take HAP or
>> IOMMU page table allocations, for example: They need to clear the
>> full page, yes. But often this is just to then insert one single
>> entry, i.e. re-use exactly one of the cache lines.
> 
> I consider this an orthogonal problem.  When we're not double-scrubbing
> most memory Xen uses, most of this goes away.
> 
> Even if we do need to scrub a pagetable to use, we're never(?) complete
> at the end of the scrub, and need to make further writes imminently. 

Right, but often to just one of the cache lines.

> These never want non-temporal accesses, because you never want to write
> into recently-evicted line, and there's no plausible way that trying to
> mix and match temporal and non-temporal stores is going to be a
> worthwhile optimisation to try.

Is a singe MOV following (with some distance and with SFENCE in
between) a sequence of MOVNTI going to have an effect worse than
the same MOV trying to store to a cache line that's not in cache?

>> Or take initial
>> population of guest RAM: The larger the guest, the less likely it
>> is for every individual page to get accessed again before its
>> contents get evicted from the caches. Judging from what Ankur said,
>> once we get to around L3 capacity, MOVNT / CLZERO may be preferable
>> there.
> 
> Initial population of guests doesn't matter at all, because nothing
> (outside of the single threaded toolstack process issuing the
> construction hypercalls) is going to touch the pages until the VM is
> unpaused.  The only async accesses I can think of are xenstored and
> xenconsoled starting up, and those are after the RAM is populated.
> 
> In cases like this, current might be a good way of choosing between
> temporal and non-temporal accesses.
> 
> As before, not double scrubbing will further improve things.
> 
>> I think in cases where we don't know how the page is going to be
>> used subsequently, we ought to favor latency over cache pollution
>> avoidance.
> 
> I broadly agree.  I think the cases where its reasonably safe to use the
> pollution-avoidance are fairly obvious, and there is a steep cost to
> wrongly-using non-temporal accesses.
> 
>> But in cases where we know the subsequent usage pattern,
>> we may want to direct scrubbing / zeroing accordingly. Yet of
>> course it's not very helpful that there's no way to avoid
>> polluting caches and still have reasonably low latency, so using
>> some heuristics may be unavoidable.
> 
> I don't think any heuristics beyond current, or possibly
> d->creation_finished are going to be worthwhile, but I think these alone
> can net us a decent win.
> 
>> And of course another goal of mine would be to avoid double zeroing
>> of pages: When scrubbing uses clear_page() anyway, there's no point
>> in the caller then calling clear_page() again. IMO, just like we
>> have xzalloc(), we should also have MEMF_zero. Internally the page
>> allocator can know whether a page was already scrubbed, and it
>> does know for sure whether scrubbing means zeroing.
> 
> I think we've discussed this before.  I'm in favour, but I'm absolutely
> certain that that wants be spelled MEMF_dirty (or equiv), so forgetting
> it fails safe, and code which is using dirty allocations is clearly
> identified and can be audited easily.

Well, there's a difference between scrubbing and zeroing. We already
have MEMF_no_scrub. And we already force callers to think about
whether they want zeroed memory (outside of the page allocator), by
having both xmalloc() and xzalloc() (and their relatives). So while
for scrubbing I could see your point, I'm not sure we should force
everyone who doesn't need zeroed pages to pass MEMF_dirty (or
whatever the name, as I don't particularly like this one). It's quite
the other way around - right now no pages come out of the page
allocator in known state content-wise. Parties presently calling
clear_page() right afterwards could easily, cleanly, and in a risk-
free manner be converted to use MEMF_zero.

Jan


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-04-21 13:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 13:58 x86: memset() / clear_page() / page scrubbing Jan Beulich
2021-04-09  6:08 ` Ankur Arora
2021-04-09  6:38   ` Jan Beulich
2021-04-09 21:01     ` Ankur Arora
2021-04-12  9:15       ` Jan Beulich
2021-04-13 13:17 ` Andrew Cooper
2021-04-14  8:12   ` Jan Beulich
2021-04-15 16:21     ` Andrew Cooper
2021-04-21 13:55       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.