On 13.04.2021 15:17, Andrew Cooper wrote: > Do you have actual numbers from these experiments? Attached is the collected raw output from a number of systems. >  I've seen your patch > from the thread, but at a minimum its missing some hunks adding new > CPUID bits. It's not missing hunks - these additions are in a prereq patch that I meant to post together with whatever this analysis would lead to. If you think I should submit the prereqs ahead of time, I can of course do so. >  I do worry however whether the testing is likely to be > realistic for non-idle scenarios. Of course it's not going to be - in non-idle scenarios we'll always be somewhere in the middle. Therefore I wanted to have numbers at the edges (hot and cold cache respectively), as any other numbers are going to be much harder to obtain in a way that they would actually be meaningful (and hence reasonably stable). > It is very little surprise that AVX-512 on Skylake is poor.  The > frequency hit from using %zmm is staggering.  IceLake is expected to be > better, but almost certainly won't exceed REP MOVSB, which is optimised > in microcode for the data width of the CPU. Right, much like AVX has improved but didn't get anywhere near REP MOVS. > For memset(), please don't move in the direction of memcpy().  memcpy() > is problematic because the common case is likely to be a multiple of 8 > bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting > avoiding. And you say this despite me having pointed out that REP STOSL may be faster in a number of cases? Or do you mean to suggest we should branch around the trailing REP {MOV,STO}SB? >  The "Fast Zero length $FOO" bits on future parts indicate > when passing %ecx=0 is likely to be faster than branching around the > invocation. IOW down the road we could use alternatives patching to remove such branches. But this of course is only if we don't end up using exclusively REP MOVSB / REP STOSB there anyway, as you seem to be suggesting ... > With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any > cleverness about larger word sizes.  The Linux forms do this fairly well > already, and probably better than Xen, although there might be some room > for improvement IMO. ... here. As to the Linux implementations - for memcpy_erms() I don't think I see any room for improvement in the function itself. We could do alternatives patching somewhat differently (and I probably would). For memset_erms() the tiny bit of improvement over Linux'es code that I would see is to avoid the partial register access when loading %al. But to be honest - in both cases I wouldn't have bothered looking at their code anyway, if you hadn't pointed me there. > It is worth nothing that we have extra variations of memset/memcpy where > __builtin_memcpy() gets expanded inline, and the result is a > compiler-chosen sequence, and doesn't hit any of our optimised > sequences.  I'm not sure what to do about this, because there is surely > a larger win from the cases which can be turned into a single mov, or an > elided store/copy, than using a potentially inefficient sequence in the > rare cases.  Maybe there is room for a fine-tuning option to say "just > call memset() if you're going to expand it inline". You mean "just call memset() instead of expanding it inline"? If the inline expansion is merely REP STOS, I'm not sure we'd actually gain anything from keeping the compiler from expanding it inline. But if the inline construct was more complicated (as I observe e.g. in map_vcpu_info() with gcc 10), then it would likely be nice if there was such a control. I'll take note to see if I can find anything. But this isn't relevant for {clear,copy}_page(). > For all set/copy operations, whether you want non-temporal or not > depends on when/where the lines are next going to be consumed.  Page > scrubbing in idle context is the only example I can think of where we > aren't plausibly going to consume the destination imminently.  Even > clear/copy page in a hypercall doesn't want to be non-temporal, because > chances are good that the vcpu is going to touch the page on return. I'm afraid the situation isn't as black-and-white. Take HAP or IOMMU page table allocations, for example: They need to clear the full page, yes. But often this is just to then insert one single entry, i.e. re-use exactly one of the cache lines. Or take initial population of guest RAM: The larger the guest, the less likely it is for every individual page to get accessed again before its contents get evicted from the caches. Judging from what Ankur said, once we get to around L3 capacity, MOVNT / CLZERO may be preferable there. I think in cases where we don't know how the page is going to be used subsequently, we ought to favor latency over cache pollution avoidance. But in cases where we know the subsequent usage pattern, we may want to direct scrubbing / zeroing accordingly. Yet of course it's not very helpful that there's no way to avoid polluting caches and still have reasonably low latency, so using some heuristics may be unavoidable. And of course another goal of mine would be to avoid double zeroing of pages: When scrubbing uses clear_page() anyway, there's no point in the caller then calling clear_page() again. IMO, just like we have xzalloc(), we should also have MEMF_zero. Internally the page allocator can know whether a page was already scrubbed, and it does know for sure whether scrubbing means zeroing. Jan