Re: Nasty clang loop unrolling..

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Philip Reames <philip@switchbackcompilers.com>
Cc: Craig Topper <craig.topper@sifive.com>,
	Nick Desaulniers <ndesaulniers@google.com>,
	 Nathan Chancellor <nathan@kernel.org>,
	clang-built-linux <clang-built-linux@googlegroups.com>,
	 llvm@lists.linux.dev
Subject: Re: Nasty clang loop unrolling..
Date: Sun, 29 Aug 2021 10:08:09 -0700	[thread overview]
Message-ID: <CAHk-=wj+BdSAGfiJO2G8QzwLbg98mzCXF95s=-5k_gLR4evdnw@mail.gmail.com> (raw)
In-Reply-To: <9e517b5d-f0e5-240a-2e3c-5cc24eda601e@switchbackcompilers.com>

On Sat, Aug 28, 2021 at 6:50 PM Philip Reames
<philip@switchbackcompilers.com> wrote:
>
> Here's the IR resulting the generic implementation from lib/string.c.

[ Again, note that this isn't really a function we care about in the
kernel. It came up mainly because I wanted to make sure it wasn't a
_total_ disaster, and the kernel ends up actually generally wanting
"small and simple" code because I$ misses is often one of the more
noticeable things.

  We have _very_ few loops with big loop counts in the kernel outside
of basically just some memory copies, and most of those are
handcrafted (often handcrafted C, but asm isn't unheard of). Most of
the time, the loops are all in user space, and then user space does a
system call that does something a small handful of times,

  So things like "loop over pathname lookup" is common, but the "loop"
is often just a couple of path components.

  And code size matters, often because the L1 I$ has been flushed by
the "real work" in user space, and so the kernel often has somewhat
cold caches (except for microbenchmarks, which lie). ]

That said:

> To me, the most interesting piece of this is not that we unrolled - it is the lowering of the select (e.g. the address manipulation).

Ok, so clang *can* turn the address generation into arithmetic (and
yes, I guess "cmp+sbb" is the much more idiomatic x86 generation, not
my odd "addb+adc"). Interesting.

It probably can go either way. The data dependency chain is likely
much worse than a well-predicted branch.

So for the kernel, I suspect that the main issue is just that "one I$
line vs three I$ lines for the unrolled case".

Having looked at all the other cases where clang makes for bigger code
with loop unrolling, I'm getting the feelign that I just need to test
"-fno-unroll-loops" more.

We actually tried to use "-Os" with gcc because of the code size
issues. But it generated so much truly horribly expensive code (using
"rep movs" for small constant-sized copies, using divide instructions
because they were smaller than multiplies with reciprocals etc) that I
gave up on that.

In general, for the kernel, we tend to aim for "do all the serious
optimizations, but avoid stuff that blows up code size". Turning the
occasional constant divide (common for things like pointer differences
in C) into a reciprocal multiply is a good optimization: it makes the
code a few bytes bigger but easily much faster. But unrolling loops is
almost always a loss, because the loop counts are small, and the
overhead of the unrolling is simply bigger than the win.

                  Linus