Re: x86: memset() / clear_page() / page scrubbing

From: Ankur Arora <ankur.a.arora@oracle.com>
To: jbeulich@suse.com
Cc: andrew.cooper3@citrix.com, roger.pau@citrix.com,
	xen-devel@lists.xenproject.org
Subject: Re: x86: memset() / clear_page() / page scrubbing
Date: Thu,  8 Apr 2021 23:08:45 -0700	[thread overview]
Message-ID: <20210409060845.3503745-1-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <0753c049-9572-c12a-c74f-7e2fac3f5a24@suse.com>

Hi Jan,

I'm working on somewhat related optimizations on Linux (clear_page(),
going in the opposite direction, from REP STOSB to MOVNT) and have
some comments/questions below.

(Discussion on v1 here:
https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/)

On 4/8/2021 6:58 AM, Jan Beulich wrote:
> All,
>
> since over the years we've been repeatedly talking of changing the
> implementation of these fundamental functions, I've taken some time
> to do some measurements (just for possible clear_page() alternatives
> to keep things manageable). I'm not sure I want to spend as much time
> subsequently on memcpy() / copy_page() (or more, because there are
> yet more combinations of arguments to consider), so for the moment I
> think the route we're going to pick here is going to more or less
> also apply to those.
>
> The present copy_page() is the way it is because of the desire to
> avoid disturbing the cache. The effect of REP STOS on the L1 cache
> (compared to the present use of MOVNTI) is more or less noticable on
> all hardware, and at least on Intel hardware more noticable when the
> cache starts out clean. For L2 the results are more mixed when
> comparing cache-clean and cache-filled cases, but the difference
> between MOVNTI and REP STOS remains or (at least on Zen2 and older
> Intel hardware) becomes more prominent.

Could you give me any pointers on the cache-effects on this? This
obviously makes sense but I couldn't come up with any benchmarks
which would show this in a straight-forward fashion.

>
> Otoh REP STOS, as was to be expected, in most cases has meaningfully
> lower latency than MOVNTI.
>
> Because I was curious I also included AVX (32-byte stores), AVX512
> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
> clear win except on the vendors' first generations implementing it
> (but I've left out any playing with CR0.TS, which is what I expect
> would take this out as an option), AVX512 isn't on Skylake (perhaps
> newer hardware does better). CLZERO has slightly higher impact on
> L1 than MOVNTI, but lower than REP STOS.

Could you elaborate on what kind of difference in L1 impact you are
talking about? Evacuation of cachelines?

> Its latency is between
> both when the caches are warm, and better than both when the caches
> are cold.
>
> Therefore I think that we want to distinguish page clearing (where
> we care about latency) from (background) page scrubbing (where I
> think the goal ought to be to avoid disturbing the caches). That
> would make it
> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>   synchronous scrubbing),
> - MOVNTI for scrub_page() (when done from idle context), unless
>   CLZERO is available.
> Whether in addition we should take into consideration activity of
> other (logical) CPUs sharing caches I don't know - this feels like
> it could get complex pretty quickly.

The one other case might be for ~L3 (or larger) regions. In my tests,
MOVNT/CLZERO is almost always better (the one exception being Skylake)
wrt both cache and latency for larger extents.

In the particular cases I was looking at (mmap+MAP_POPULATE and
page-fault path), that makes the choice of always using MOVNT/CLZERO
easy for GB pages, but fuzzier for 2MB pages.

Not sure if the large-page case is interesting for you though.

Thanks
Ankur

>
> For memset() we already simply use REP STOSB. I don't see a strong
> need to change that, but it may be worth to consider bringing it
> closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
> They perform somewhat better in a number of cases (including when
> ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
> what I would have expected). We may want to put the whole thing in
> a .S file though, seeing that the C function right now consists of
> little more than an asm().
>
> For memcpy() I'm inclined to suggest that we simply use REP MOVSB
> on ERMS hardware, and stay with what we have everywhere else.
>
> copy_page() (or really copy_domain_page()) doesn't have many uses,
> so I'm not sure how worthwhile it is to do much optimization there.
> It might be an option to simply expand it to memcpy(), like Arm
> does.
>
> Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
> may want to figure out whether using these for strlen(), strcmp(),
> strchr(), memchr(), and/or memcmp() would be a win.
>
> Thoughts anyone, before I start creating actual patches?
>
> Jan
>