x86: memset() / clear_page() / page scrubbing

* x86: memset() / clear_page() / page scrubbing
@ 2021-04-08 13:58 Jan Beulich
  2021-04-09  6:08 ` Ankur Arora
  2021-04-13 13:17 ` Andrew Cooper
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Beulich @ 2021-04-08 13:58 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Roger Pau Monné

All,

since over the years we've been repeatedly talking of changing the
implementation of these fundamental functions, I've taken some time
to do some measurements (just for possible clear_page() alternatives
to keep things manageable). I'm not sure I want to spend as much time
subsequently on memcpy() / copy_page() (or more, because there are
yet more combinations of arguments to consider), so for the moment I
think the route we're going to pick here is going to more or less
also apply to those.

The present copy_page() is the way it is because of the desire to
avoid disturbing the cache. The effect of REP STOS on the L1 cache
(compared to the present use of MOVNTI) is more or less noticable on
all hardware, and at least on Intel hardware more noticable when the
cache starts out clean. For L2 the results are more mixed when
comparing cache-clean and cache-filled cases, but the difference
between MOVNTI and REP STOS remains or (at least on Zen2 and older
Intel hardware) becomes more prominent.

Otoh REP STOS, as was to be expected, in most cases has meaningfully
lower latency than MOVNTI.

Because I was curious I also included AVX (32-byte stores), AVX512
(64-byte stores), and AMD's CLZERO in my testing. While AVX is a
clear win except on the vendors' first generations implementing it
(but I've left out any playing with CR0.TS, which is what I expect
would take this out as an option), AVX512 isn't on Skylake (perhaps
newer hardware does better). CLZERO has slightly higher impact on
L1 than MOVNTI, but lower than REP STOS. Its latency is between
both when the caches are warm, and better than both when the caches
are cold.

Therefore I think that we want to distinguish page clearing (where
we care about latency) from (background) page scrubbing (where I
think the goal ought to be to avoid disturbing the caches). That
would make it
- REP STOS{L,Q} for clear_page() (perhaps also to be used for
  synchronous scrubbing),
- MOVNTI for scrub_page() (when done from idle context), unless
  CLZERO is available.
Whether in addition we should take into consideration activity of
other (logical) CPUs sharing caches I don't know - this feels like
it could get complex pretty quickly.

For memset() we already simply use REP STOSB. I don't see a strong
need to change that, but it may be worth to consider bringing it
closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
They perform somewhat better in a number of cases (including when
ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
what I would have expected). We may want to put the whole thing in
a .S file though, seeing that the C function right now consists of
little more than an asm().

For memcpy() I'm inclined to suggest that we simply use REP MOVSB
on ERMS hardware, and stay with what we have everywhere else.

copy_page() (or really copy_domain_page()) doesn't have many uses,
so I'm not sure how worthwhile it is to do much optimization there.
It might be an option to simply expand it to memcpy(), like Arm
does.

Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
may want to figure out whether using these for strlen(), strcmp(),
strchr(), memchr(), and/or memcmp() would be a win.

Thoughts anyone, before I start creating actual patches?

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread