Re: [RFC] Improve memset

From: Borislav Petkov <bp@alien8.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Rasmus Villemoes <mail@rasmusvillemoes.dk>,
	x86-ml <x86@kernel.org>, Andy Lutomirski <luto@kernel.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] Improve memset
Date: Tue, 17 Sep 2019 10:15:49 +0200	[thread overview]
Message-ID: <20190917081549.GB12174@nazgul.tnic> (raw)
In-Reply-To: <CAHk-=wjdpJ+VapXfoZE8JRUfvMb8JrVTZe0=TDFYZ-ke+uqBOA@mail.gmail.com>

On Mon, Sep 16, 2019 at 10:25:25AM -0700, Linus Torvalds wrote:
> So the "inline constant sizes" case has advantages over and beyond the
> obvious ones. I suspect that a reasonable cut-off point is somethinig
> like "8*sizeof(long)". But look at things like "struct kstat" uses
> etc, the limit might actually be even higher than that.

Ok, sounds to me like we want to leave the small and build-time known
sizes to gcc to optimize. Especially if you want to clear only a subset
of the struct members and set the rest.

> Also note that while "rep stosb" is _reasonably_ good with current
> CPU's (ie roughly gen 8+), it's not so great a few generations ago
> (gen 6ish), and it can be absolutely horrid on older cores and/or
> atom. The limit for when it is a win ends up depending on whether I$
> footprint is an issue too, of course, but some of the bigger wins tend
> to happen when you have sizes >= 128.

Ok, so I have X86_FEATURE_ERMS set on this old IVB laptop so if gen8 and
newer do perform better, then we need to change our feature detection to
clear ERMS on those older generations and really leave it set only on
gen8+.

I'll change my benchmark to do explicit small-sized moves (instead of
using __builtin_memset) to see whether I can compare performance better.

Which also begs the question do we do __builtin_memset() at all or we
code those small-sized moves ourselves? We'll lose the advantage of the
partial struct members clearing but then we avoid the differences the
different compiler versions would introduce when the builtin is used.

And last but not least we should not leave readability somewhere along
the way: I'd prefer good performance and maintainable code than some
hypothetical max perf on some uarch but ugly and unreadable macro
mess...

> You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
> code for specific cases if you don't look at I$ footprint and the cost
> of the AVX setup (and the cost of frequency changes, which often go
> hand-in-hand with the AVX use). So "rep movs/stos" is seldom
> _optimal_, but it tends to be "quite good" for modern CPU's with
> variable sizes that are in the 100+ byte range.

Right. If we did this, it would be a conditional in front of the REP;
STOS but that ain't worse than our current CALL+JMP.

Thx.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--