Re: x86: memset() / clear_page() / page scrubbing

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: x86: memset() / clear_page() / page scrubbing
Date: Thu, 15 Apr 2021 17:21:09 +0100	[thread overview]
Message-ID: <843e5c66-65e9-3b0b-8bcb-ad1d7df89d78@citrix.com> (raw)
In-Reply-To: <213c3706-5296-4673-dae2-12f9056ed73b@suse.com>

On 14/04/2021 09:12, Jan Beulich wrote:
> On 13.04.2021 15:17, Andrew Cooper wrote:
>> Do you have actual numbers from these experiments?
> Attached is the collected raw output from a number of systems.

Wow Tulsa is vintage.  Is that new enough to have nonstop_tsc ?

It's also quite possibly old enough to fail Linux's REP_GOOD check which
is something we're missing in Xen.

>>   I've seen your patch
>> from the thread, but at a minimum its missing some hunks adding new
>> CPUID bits.
> It's not missing hunks - these additions are in a prereq patch that
> I meant to post together with whatever this analysis would lead to.
> If you think I should submit the prereqs ahead of time, I can of
> course do so.

Well - its necessary for anyone wanting to compile your patch.

All the bits seem like they're ones we ought to mirror through to guests
irrespective of optimisations taken in Xen, so I don't see a problem
with such a patch going in stand-alone.

>>   I do worry however whether the testing is likely to be
>> realistic for non-idle scenarios.
> Of course it's not going to be - in non-idle scenarios we'll always
> be somewhere in the middle. Therefore I wanted to have numbers at
> the edges (hot and cold cache respectively), as any other numbers
> are going to be much harder to obtain in a way that they would
> actually be meaningful (and hence reasonably stable).

In non-idle scenarios, all numbers can easily be worse across the board
than as measured.

Cachline snoops, and in particular, repeated snoops during the
clear/copy operation will cause far worse overheads than simply being
cache-cold to begin with.

>> It is very little surprise that AVX-512 on Skylake is poor.  The
>> frequency hit from using %zmm is staggering.  IceLake is expected to be
>> better, but almost certainly won't exceed REP MOVSB, which is optimised
>> in microcode for the data width of the CPU.
> Right, much like AVX has improved but didn't get anywhere near
> REP MOVS.

The other thing I forgot to mention is the legacy/VEX pipeline stall
penalty, which will definitely dwarf short operations, and on some CPUs
doesn't even amortise itself over a 4k operation.

Whether a vector algorithm suffers a stall or not typically depends on
the instructions last executed in guest context.

Furthermore, while on the current CPU you might manage to get a vector
algorithm to be faster than an integer one, you will be forcing a
frequency drop on every other CPU in the socket, and the net hit to the
system can be worse than just using the integer algorithm to being with.

IMO, vector optimised algorithms are a minefield we don't want to go
wandering in, particularly as ERMS is common these days, and here to stay.

>> For memset(), please don't move in the direction of memcpy().  memcpy()
>> is problematic because the common case is likely to be a multiple of 8
>> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
>> avoiding.
> And you say this despite me having pointed out that REP STOSL may
> be faster in a number of cases? Or do you mean to suggest we should
> branch around the trailing REP {MOV,STO}SB?
>
>>   The "Fast Zero length $FOO" bits on future parts indicate
>> when passing %ecx=0 is likely to be faster than branching around the
>> invocation.
> IOW down the road we could use alternatives patching to remove such
> branches. But this of course is only if we don't end up using
> exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
> suggesting ...
>
>> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
>> cleverness about larger word sizes.  The Linux forms do this fairly well
>> already, and probably better than Xen, although there might be some room
>> for improvement IMO.
> ... here.
>
> As to the Linux implementations - for memcpy_erms() I don't think
> I see any room for improvement in the function itself. We could do
> alternatives patching somewhat differently (and I probably would).
> For memset_erms() the tiny bit of improvement over Linux'es code
> that I would see is to avoid the partial register access when
> loading %al. But to be honest - in both cases I wouldn't have
> bothered looking at their code anyway, if you hadn't pointed me
> there.

Answering multiple of the points together.

Yes, the partial register access on %al was one thing I spotted, and
movbzl would be an improvement.  The alternatives are a bit weird, but
they're best as they are IMO.  It makes a useful enough difference to
backtraces/etc, and unconditional jmp's are about as close to free as
you can get these days.

On an ERMS system, we want to use REP MOVSB unilaterally.  It is my
understanding that it is faster across the board than any algorithm
variation trying to use wider accesses.

For non ERMS systems,  the split MOVSQ/MOVSB is still a win, but my
expectation is that conditionally jumping over the latter MOVSB would be
a win.

The "Fast zero length" CPUID bits don't exist for no reason, and while
passing 0 into memcpy/cmp() is exceedingly rare - possibly non-existent
- and not worth optimising, passing a multiple of 8 in probably is worth
optimising.  (Obviously, this depends on the underlying mem*() functions
seeing a multiple of 8 for a meaningful number of their inputs, but I'd
expect this to be the case).

>> It is worth nothing that we have extra variations of memset/memcpy where
>> __builtin_memcpy() gets expanded inline, and the result is a
>> compiler-chosen sequence, and doesn't hit any of our optimised
>> sequences.  I'm not sure what to do about this, because there is surely
>> a larger win from the cases which can be turned into a single mov, or an
>> elided store/copy, than using a potentially inefficient sequence in the
>> rare cases.  Maybe there is room for a fine-tuning option to say "just
>> call memset() if you're going to expand it inline".
> You mean "just call memset() instead of expanding it inline"?

I think want I really mean is "if the result of optimising memset() is
going to result in a REP instruction, call memset() instead".

You want the compiler to do conversion to single mov's/etc, but you
don't want is ...

> If the inline expansion is merely REP STOS, I'm not sure we'd
> actually gain anything from keeping the compiler from expanding it
> inline. But if the inline construct was more complicated (as I
> observe e.g. in map_vcpu_info() with gcc 10), then it would likely
> be nice if there was such a control. I'll take note to see if I
> can find anything.

... this.  What GCC currently expands inline is a REP MOVS{L,Q}, with
the first and final element done manually ahead of the REP, presumably
for prefetching/pagewalk reasons.

The exact sequence varies due to the surrounding code, and while its
probably a decent stab for -O2/3 on "any arbitrary 64bit CPU", its not
helpful when we've got a system-optimised mem*() to hand.

> But this isn't relevant for {clear,copy}_page().
>
>> For all set/copy operations, whether you want non-temporal or not
>> depends on when/where the lines are next going to be consumed.  Page
>> scrubbing in idle context is the only example I can think of where we
>> aren't plausibly going to consume the destination imminently.  Even
>> clear/copy page in a hypercall doesn't want to be non-temporal, because
>> chances are good that the vcpu is going to touch the page on return.
> I'm afraid the situation isn't as black-and-white. Take HAP or
> IOMMU page table allocations, for example: They need to clear the
> full page, yes. But often this is just to then insert one single
> entry, i.e. re-use exactly one of the cache lines.

I consider this an orthogonal problem.  When we're not double-scrubbing
most memory Xen uses, most of this goes away.

Even if we do need to scrub a pagetable to use, we're never(?) complete
at the end of the scrub, and need to make further writes imminently. 
These never want non-temporal accesses, because you never want to write
into recently-evicted line, and there's no plausible way that trying to
mix and match temporal and non-temporal stores is going to be a
worthwhile optimisation to try.

> Or take initial
> population of guest RAM: The larger the guest, the less likely it
> is for every individual page to get accessed again before its
> contents get evicted from the caches. Judging from what Ankur said,
> once we get to around L3 capacity, MOVNT / CLZERO may be preferable
> there.

Initial population of guests doesn't matter at all, because nothing
(outside of the single threaded toolstack process issuing the
construction hypercalls) is going to touch the pages until the VM is
unpaused.  The only async accesses I can think of are xenstored and
xenconsoled starting up, and those are after the RAM is populated.

In cases like this, current might be a good way of choosing between
temporal and non-temporal accesses.

As before, not double scrubbing will further improve things.

> I think in cases where we don't know how the page is going to be
> used subsequently, we ought to favor latency over cache pollution
> avoidance.

I broadly agree.  I think the cases where its reasonably safe to use the
pollution-avoidance are fairly obvious, and there is a steep cost to
wrongly-using non-temporal accesses.

> But in cases where we know the subsequent usage pattern,
> we may want to direct scrubbing / zeroing accordingly. Yet of
> course it's not very helpful that there's no way to avoid
> polluting caches and still have reasonably low latency, so using
> some heuristics may be unavoidable.

I don't think any heuristics beyond current, or possibly
d->creation_finished are going to be worthwhile, but I think these alone
can net us a decent win.

> And of course another goal of mine would be to avoid double zeroing
> of pages: When scrubbing uses clear_page() anyway, there's no point
> in the caller then calling clear_page() again. IMO, just like we
> have xzalloc(), we should also have MEMF_zero. Internally the page
> allocator can know whether a page was already scrubbed, and it
> does know for sure whether scrubbing means zeroing.

I think we've discussed this before.  I'm in favour, but I'm absolutely
certain that that wants be spelled MEMF_dirty (or equiv), so forgetting
it fails safe, and code which is using dirty allocations is clearly
identified and can be audited easily.

~Andrew