All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Nasty clang loop unrolling..
       [not found] <CAHk-=wiNHx_GpjoWt9VMffKunZZy5MaTe3pM+cpBgE7OyyrX5Q@mail.gmail.com>
@ 2021-08-28 20:29 ` Nick Desaulniers
  2021-08-28 22:42   ` Linus Torvalds
       [not found]   ` <37453471-1498-4C1C-8022-93697D8C2DD4@sifive.com>
  0 siblings, 2 replies; 3+ messages in thread
From: Nick Desaulniers @ 2021-08-28 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, clang-built-linux, llvm, craig.topper, Philip Reames

(We're moving from clang-built-linux@googlegroups.com to
llvm@lists.linux.dev; sorry for the churn, but we think this will make
for more accessible archival and access from lore.kernel.org)

On Sat, Aug 28, 2021 at 11:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it turns out that s390 had a bug due to its own private 'strncpy()'
> being broken and not doing the insane thing that strncpy() is defined
> to do.

Like continuing to zero the rest of the buffer up to n?

>
> Which is fine - I understand exactly how that happens, and strncpy()
> is one of my least favorite functions.
>
> Anyway, I suggested that s390 just use the generic function we have,
> instead of implementing its own version, because nobody really cares,
> and the generic function is small and simple and "good enough". See
>
>     https://lore.kernel.org/lkml/CAHk-=wjhKNB_1a6wjjPh2PvMrtjVs=DgGY5uM2jq3WBBaYMyGQ@mail.gmail.com/
>
> for details, although they don't really matter for this email.
>
> What matters for this email is that writing that thing made me go "ok,
> just how good does that generic version look, compared to the old
> legacy 32-bit historical version that uses the x86 string instructions
> just because it literally goes all the way back to my learning the
> i386 and learning gcc inline asm".
>
> Because yes, that routine *literally* exists in that exact form in
> linux-0.01 - it's moved, and it has lost a "cld" instruction since we
> now have the rule that DF is always clear in the kernel, but our old
> 32-bit x86 'strncpy()" is horrible slow garbage, but also a historical
> gem from 30 years ago.
>
> But x86-64 doesn't do that, so I just built lib/string,c with clang,
> to see what it could do.
>
> Can clang do better than complete garbage written by a clueless person
> from three decades ago?
>
> The end result is not good.
>
> Clang decides to unroll that loop four times, and in the process
> making the code 4x the size it should be, for absolutely zero gain.
>
> This is the whole function with #pragma nounroll (ie "sane"):
>
>    strncpy:
>            movq %rdi, %rax
>            testq        %rdx, %rdx
>            je   .LBB3_5
>            xorl %ecx, %ecx
>            jmp  .LBB3_2
>            .p2align     4, 0x90
>    .LBB3_4:
>            addq $1, %rcx
>            cmpq %rcx, %rdx
>            je   .LBB3_5
>    .LBB3_2:
>            movzbl       (%rsi), %edi
>            movb %dil, (%rax,%rcx)
>            testb        %dil, %dil
>            je   .LBB3_4
>            addq $1, %rsi
>            jmp  .LBB3_4
>    .LBB3_5:
>            retq
>
> and honestly, that's perfectly fine. It's very much what the code
> does. It's 44 bytes, it fits in one cacheline, it's not horrible. It's
> not what I would have done by hand, and clang seems a bit too eager to
> move the loop end test to the top of the loop, but whatever. I see
> nothing that makes me go "that's horrible".

For the loop test, I know that clang will "rotate" loops in an attempt
to have one canonical loop form. That way passes don't have to check
for multiple different forms of loops if they're all in one form.
This reduces compile time and complexity in the compiler.  Does it
always produce the most optimal loops?  Is that what is going on here?
I'm not sure.

>
> Now, admittedly it's not particularly *smart* either - you could turn
> the conditional "branch over a single constant add" into a computed
> add instead, so the
>
>            testb        %dil, %dil
>            je   .LBB3_4
>            addq $1, %rsi
>            jmp  .LBB3_4
>
> could - for example - have been done as
>
>            addb $255,%dil
>            adcq $0, %rsi
>            jmp  .LBB3_4
>
> which could avoid some branch mispredicts.  And honestly then the code

Perhaps a peephole optimization we can add? (Does the adcq - add with
carry - rely on whether the previous addb overflowed, ie. %dil was
non-zero, replacing the testb+je pair? Did I understand that
correctly?)  Though we'd have to know that %dil wasn't used after
taking the jump, since I think your transformed version modified %dil,
so perhaps that can't be a peephole opt. Hmm.

> that clang moved to the top should really have been at the bottom of
> the loop, but I don't know if it would matter. The above might look a
> bit more clever, but the data dependency might be worse if the branch
> predicts well. The branch behavior is bimodal - the loop starts out
> not taking that "je", and ends up taking it - so it has an almost
> guaranteed mispredict in the middle of the loop, but whatever. You win
> some, you lose some.
>
> ANYWAY.
>
> The above discussion is about *reasonable* code.
>
> What clang actually generates bears very little resemblance to either
> the above simple and short, or the "clever and one conditional branch
> shorter" version.
>
> What clang actually generates is this horror:
>
>    strncpy:
>            movq %rdi, %rax
>            testq        %rdx, %rdx
>            je   .LBB3_19
>            leaq -1(%rdx), %r8
>            movq %rdx, %r9
>            andq $3, %r9
>            je   .LBB3_2
>            xorl %edi, %edi
>            jmp  .LBB3_4
>            .p2align     4, 0x90
>    .LBB3_6:
>            addq $1, %rdi
>            cmpq %rdi, %r9
>            je   .LBB3_7
>    .LBB3_4:
>            movzbl       (%rsi), %ecx
>            movb %cl, (%rax,%rdi)
>            testb        %cl, %cl
>            je   .LBB3_6
>            addq $1, %rsi
>            jmp  .LBB3_6
>    .LBB3_7:
>            leaq (%rax,%rdi), %r9
>            subq %rdi, %rdx
>            cmpq $3, %r8
>            jb   .LBB3_19
>            jmp  .LBB3_9
>    .LBB3_2:
>            movq %rax, %r9
>            cmpq $3, %r8
>            jae  .LBB3_9
>    .LBB3_19:
>            retq
>    .LBB3_9:
>            xorl %edi, %edi
>            jmp  .LBB3_10
>            .p2align     4, 0x90
>    .LBB3_18:
>            addq $4, %rdi
>            cmpq %rdi, %rdx
>            je   .LBB3_19
>    .LBB3_10:
>            movzbl       (%rsi), %ecx
>            movb %cl, (%r9,%rdi)
>            testb        %cl, %cl
>            je   .LBB3_12
>            addq $1, %rsi
>    .LBB3_12:
>            movzbl       (%rsi), %ecx
>            movb %cl, 1(%r9,%rdi)
>            testb        %cl, %cl
>            je   .LBB3_14
>            addq $1, %rsi
>    .LBB3_14:
>            movzbl       (%rsi), %ecx
>            movb %cl, 2(%r9,%rdi)
>            testb        %cl, %cl
>            je   .LBB3_16
>            addq $1, %rsi
>    .LBB3_16:
>            movzbl       (%rsi), %ecx
>            movb %cl, 3(%r9,%rdi)
>            testb        %cl, %cl
>            je   .LBB3_18
>            addq $1, %rsi
>            jmp  .LBB3_18
>
> which is 170 bytes in size instead of the 44 bytes, so now it takes up
> three cachelines.
>
> Now, I don't know how common this is. Maybe this is the only place in
> the kernel where this unrolling case happens. But in general, loop
> unrolling in the kernel is a big mistake unless it's a very obvious
> case (ie small constant full unroll makes perfect sense: if you see
>
>         if (i = 0; i < 4; i++)
>                 *p++ = *q++;
>
> then you should most definitely unroll that to
>
>         *p++ = *q++;
>         *p++ = *q++;
>         *p++ = *q++;
>         *p++ = *q++;
>
> because it's simply smaller and simpler to not have any conditionals
> at all, and just do four iterations statically.
>
> But the 'strncpy()' kind of unrolling is a mistake when kernel loops
> tend to have very low loop counts.
>
> As far as I know, gcc doesn't do any unrolling at -O2.

For clang, we will do limited unrolling at -O2, and very aggressive
unrolling at -O3; if a loop can be fully unrolled, we're likely to do
so at -O3, with a much smaller or more conservative unroll at -O2.  I
think I demonstrated that in this talk, if you have the time or are
interested more in introspecting the compiler (yeah, yeah, ain't
nobody got time for that): https://youtu.be/bUTXhcf_aNc?t=1415

My hypothesis here is that LLVM may not be considering -mno-sse2 and
friends (ie. no floating at all, please) that the kernel uses when
doing its simpler unrolling.  If the monstrosity looks more compact
with none of the -mno-sse2 and friends flags set, then that would lend
itself to that hypothesis.  IIRC, the middle end does loop unrolling
in a non-machine agnostic manner; it has to know what's the basic
width of SIMD since we'd generally like to vectorize a loop after
we've unrolled it, so the pass is aware of specifics of the target
machine (this is exceptional to me; I understand why it's necessary,
but generally the middle end optimizations are machine agnostic).
Then later once we get to actual machine code generation for x86, we
discover the constraints that we can't actually use any of the SSE
registers and instead generate more verbose loop iterations using
GPRs.  Likely, the middle end unroller needs to check that -sse2
wasn't set BEFORE thinking it has the green light to unroll a loop x4.
But it's just a hypothesis; I haven't validated it yet, and I could be
wildly wrong.

>
> What is the magic to make clang not do stupid things like this? I
> obviously know about that
>
>     #pragma nounroll
>
> but I don't want to mark various unimportant functions. I'd much
> rather have the default be "don't do stupid things", and then if we
> see a case where loop unrolling really matters, and it's important, we
> can mark *that* specially.
>
> Hmm?
>
>                 Linus



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Nasty clang loop unrolling..
  2021-08-28 20:29 ` Nasty clang loop unrolling Nick Desaulniers
@ 2021-08-28 22:42   ` Linus Torvalds
       [not found]   ` <37453471-1498-4C1C-8022-93697D8C2DD4@sifive.com>
  1 sibling, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2021-08-28 22:42 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Nathan Chancellor, clang-built-linux, llvm, craig.topper, Philip Reames

On Sat, Aug 28, 2021 at 1:29 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> > So it turns out that s390 had a bug due to its own private 'strncpy()'
> > being broken and not doing the insane thing that strncpy() is defined
> > to do.
>
> Like continuing to zero the rest of the buffer up to n?

Right.

> > This is the whole function with #pragma nounroll (ie "sane"):
> > [ .. snip snip .. ]
> >
> > and honestly, that's perfectly fine. It's very much what the code
> > does. It's 44 bytes, it fits in one cacheline, it's not horrible. It's
> > not what I would have done by hand, and clang seems a bit too eager to
> > move the loop end test to the top of the loop, but whatever. I see
> > nothing that makes me go "that's horrible".
>
> For the loop test, I know that clang will "rotate" loops in an attempt
> to have one canonical loop form.

Yeah, I've seen it before. I think it makes the end result harder to
read and less intuitive, and it seems to generate more silly
branch-arounds, but I doubt that odd loop rotation behavior matters
much.

> > Now, admittedly it's not particularly *smart* either - you could turn
> > the conditional "branch over a single constant add" into a computed
> > add instead

Note that the loop rotation and the lack of turning a conditional into
a computed add are just small details - suggestions on how the code
generation could _actually_ have been improved, instead of how clang
instead actively made it worse.

And note how this function *really* doesn't matter from a performance
standpoint on its own. So again, the "arithmetic vs conditional" is
just a comment on the basic code generation, but the worry is just
that the kernel really doesn't want a compiler that blows up functions
by this kind of big factor for no gain.

Because the real question is how to turn off the bad behavior:

> > What clang actually generates bears very little resemblance to either
> > the above simple and short, or the "clever and one conditional branch
> > shorter" version.
> >
> > What clang actually generates is this horror:

because while it doesn't matter if it were just this one case, I
suspect that one case is just the one I happened to look at.

Because I can find "-fno-unroll-loops" as an option, and it does fix
the problem.

But I am surprised that clang did this bogus thing at -O2, and I get
the feeling that there is some actual problem going on.

You say:

> For clang, we will do limited unrolling at -O2, and very aggressive
> unrolling at -O3; if a loop can be fully unrolled, we're likely to do
> so at -O3, with a much smaller or more conservative unroll at -O2.

so clearly there are different heuristics for unrolling than just
"don't unroll at all". And I wouldn't want to disable the *sane* kind
of unrolling of small constant loop counts.

I tried compiling the whole kernel with -fno-unroll-loops, and I do
see a number of functions that change in size quite noticeably.

I may have done something wrong, but "do_check" (the bpf verifier)
went from 45002 bytes to 33632 bytes when I asked clang not to unroll
loops.

A few functions grew, so there may be some other code generation issue
going on (ie maybe it then inlines differently too).

My comparison was truly stupid:

  #generate the 'size' file
   objdump -j .text --syms ~/vmlinux |
       grep '^fff' | cut -c32- |
       while read i name; do echo $name $((0x0$i)); done > size

   # then, using 'size' and 'newsize' files, show
   # the function size differences
   join size newsize | while read name old new;
         do echo $((new-old)) $name; done | sort -n | less -S

so it's not like I did any real analysis. I checked a few surprising
cases, and to pick another example, "__mmdrop()" has this code:

        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = atomic_long_read(&mm->rss_stat.count[i]);

                if (unlikely(x))
                        pr_alert("BUG: Bad rss-counter state mm:%p
type:%s val:%ld\n",
                                 mm, resident_page_types[i], x);
        }

and clang will just duplicate the code NR_MM_COUNTERS (four) times.
Which is what I'd _ask_ for, if the loop body was just one or two
instructions, but when it's a debug function with a function call...

So it does look like loop unrolling can make a noticeable size
difference, and I suspect it's all basically a complete waste.

How bad is -fno-unroll-loops? Is there some subtler way to avoid the
bad cases? I can find the option when I google for it, but nothing
actually useful about when it's enabled, or what the different
optimization level heuristics are.

Is there some way to do it only for the loop unrolling for the truly
obvious "this doesn't actually expand the code, because the loop body
is pretty much the same size as the conditional"

         Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Nasty clang loop unrolling..
       [not found]     ` <9e517b5d-f0e5-240a-2e3c-5cc24eda601e@switchbackcompilers.com>
@ 2021-08-29 17:08       ` Linus Torvalds
  0 siblings, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2021-08-29 17:08 UTC (permalink / raw)
  To: Philip Reames
  Cc: Craig Topper, Nick Desaulniers, Nathan Chancellor,
	clang-built-linux, llvm

On Sat, Aug 28, 2021 at 6:50 PM Philip Reames
<philip@switchbackcompilers.com> wrote:
>
> Here's the IR resulting the generic implementation from lib/string.c.

[ Again, note that this isn't really a function we care about in the
kernel. It came up mainly because I wanted to make sure it wasn't a
_total_ disaster, and the kernel ends up actually generally wanting
"small and simple" code because I$ misses is often one of the more
noticeable things.

  We have _very_ few loops with big loop counts in the kernel outside
of basically just some memory copies, and most of those are
handcrafted (often handcrafted C, but asm isn't unheard of). Most of
the time, the loops are all in user space, and then user space does a
system call that does something a small handful of times,

  So things like "loop over pathname lookup" is common, but the "loop"
is often just a couple of path components.

  And code size matters, often because the L1 I$ has been flushed by
the "real work" in user space, and so the kernel often has somewhat
cold caches (except for microbenchmarks, which lie). ]

That said:

> To me, the most interesting piece of this is not that we unrolled - it is the lowering of the select (e.g. the address manipulation).

Ok, so clang *can* turn the address generation into arithmetic (and
yes, I guess "cmp+sbb" is the much more idiomatic x86 generation, not
my odd "addb+adc"). Interesting.

It probably can go either way. The data dependency chain is likely
much worse than a well-predicted branch.

So for the kernel, I suspect that the main issue is just that "one I$
line vs three I$ lines for the unrolled case".

Having looked at all the other cases where clang makes for bigger code
with loop unrolling, I'm getting the feelign that I just need to test
"-fno-unroll-loops" more.

We actually tried to use "-Os" with gcc because of the code size
issues. But it generated so much truly horribly expensive code (using
"rep movs" for small constant-sized copies, using divide instructions
because they were smaller than multiplies with reciprocals etc) that I
gave up on that.

In general, for the kernel, we tend to aim for "do all the serious
optimizations, but avoid stuff that blows up code size". Turning the
occasional constant divide (common for things like pointer differences
in C) into a reciprocal multiply is a good optimization: it makes the
code a few bytes bigger but easily much faster. But unrolling loops is
almost always a loss, because the loop counts are small, and the
overhead of the unrolling is simply bigger than the win.

                  Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-29 17:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=wiNHx_GpjoWt9VMffKunZZy5MaTe3pM+cpBgE7OyyrX5Q@mail.gmail.com>
2021-08-28 20:29 ` Nasty clang loop unrolling Nick Desaulniers
2021-08-28 22:42   ` Linus Torvalds
     [not found]   ` <37453471-1498-4C1C-8022-93697D8C2DD4@sifive.com>
     [not found]     ` <9e517b5d-f0e5-240a-2e3c-5cc24eda601e@switchbackcompilers.com>
2021-08-29 17:08       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.