On 13.04.2021 15:17, Andrew Cooper wrote:
> Do you have actual numbers from these experiments?

Attached is the collected raw output from a number of systems.

>  I've seen your patch
> from the thread, but at a minimum its missing some hunks adding new
> CPUID bits.

It's not missing hunks - these additions are in a prereq patch that
I meant to post together with whatever this analysis would lead to.
If you think I should submit the prereqs ahead of time, I can of
course do so.

>  I do worry however whether the testing is likely to be
> realistic for non-idle scenarios.

Of course it's not going to be - in non-idle scenarios we'll always
be somewhere in the middle. Therefore I wanted to have numbers at
the edges (hot and cold cache respectively), as any other numbers
are going to be much harder to obtain in a way that they would
actually be meaningful (and hence reasonably stable).

> It is very little surprise that AVX-512 on Skylake is poor.  The
> frequency hit from using %zmm is staggering.  IceLake is expected to be
> better, but almost certainly won't exceed REP MOVSB, which is optimised
> in microcode for the data width of the CPU.

Right, much like AVX has improved but didn't get anywhere near
REP MOVS.

> For memset(), please don't move in the direction of memcpy().  memcpy()
> is problematic because the common case is likely to be a multiple of 8
> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
> avoiding.

And you say this despite me having pointed out that REP STOSL may
be faster in a number of cases? Or do you mean to suggest we should
branch around the trailing REP {MOV,STO}SB?

>  The "Fast Zero length $FOO" bits on future parts indicate
> when passing %ecx=0 is likely to be faster than branching around the
> invocation.

IOW down the road we could use alternatives patching to remove such
branches. But this of course is only if we don't end up using
exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
suggesting ...

> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
> cleverness about larger word sizes.  The Linux forms do this fairly well
> already, and probably better than Xen, although there might be some room
> for improvement IMO.

... here.

As to the Linux implementations - for memcpy_erms() I don't think
I see any room for improvement in the function itself. We could do
alternatives patching somewhat differently (and I probably would).
For memset_erms() the tiny bit of improvement over Linux'es code
that I would see is to avoid the partial register access when
loading %al. But to be honest - in both cases I wouldn't have
bothered looking at their code anyway, if you hadn't pointed me
there.

> It is worth nothing that we have extra variations of memset/memcpy where
> __builtin_memcpy() gets expanded inline, and the result is a
> compiler-chosen sequence, and doesn't hit any of our optimised
> sequences.  I'm not sure what to do about this, because there is surely
> a larger win from the cases which can be turned into a single mov, or an
> elided store/copy, than using a potentially inefficient sequence in the
> rare cases.  Maybe there is room for a fine-tuning option to say "just
> call memset() if you're going to expand it inline".

You mean "just call memset() instead of expanding it inline"?

If the inline expansion is merely REP STOS, I'm not sure we'd
actually gain anything from keeping the compiler from expanding it
inline. But if the inline construct was more complicated (as I
observe e.g. in map_vcpu_info() with gcc 10), then it would likely
be nice if there was such a control. I'll take note to see if I
can find anything.

But this isn't relevant for {clear,copy}_page().

> For all set/copy operations, whether you want non-temporal or not
> depends on when/where the lines are next going to be consumed.  Page
> scrubbing in idle context is the only example I can think of where we
> aren't plausibly going to consume the destination imminently.  Even
> clear/copy page in a hypercall doesn't want to be non-temporal, because
> chances are good that the vcpu is going to touch the page on return.

I'm afraid the situation isn't as black-and-white. Take HAP or
IOMMU page table allocations, for example: They need to clear the
full page, yes. But often this is just to then insert one single
entry, i.e. re-use exactly one of the cache lines. Or take initial
population of guest RAM: The larger the guest, the less likely it
is for every individual page to get accessed again before its
contents get evicted from the caches. Judging from what Ankur said,
once we get to around L3 capacity, MOVNT / CLZERO may be preferable
there.

I think in cases where we don't know how the page is going to be
used subsequently, we ought to favor latency over cache pollution
avoidance. But in cases where we know the subsequent usage pattern,
we may want to direct scrubbing / zeroing accordingly. Yet of
course it's not very helpful that there's no way to avoid
polluting caches and still have reasonably low latency, so using
some heuristics may be unavoidable.

And of course another goal of mine would be to avoid double zeroing
of pages: When scrubbing uses clear_page() anyway, there's no point
in the caller then calling clear_page() again. IMO, just like we
have xzalloc(), we should also have MEMF_zero. Internally the page
allocator can know whether a page was already scrubbed, and it
does know for sure whether scrubbing means zeroing.

Jan