From: David Laight <David.Laight@ACULAB.COM> To: 'Segher Boessenkool' <segher@kernel.crashing.org> Cc: 'Rasmus Villemoes' <rasmus.villemoes@prevas.dk>, Christophe Leroy <christophe.leroy@csgroup.eu>, "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>, Paul Mackerras <paulus@samba.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> Subject: RE: [PATCH] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure Date: Tue, 16 Mar 2021 09:35:26 +0000 [thread overview] Message-ID: <e2493e6aaa454604a10dd811a369d104@AcuMS.aculab.com> (raw) In-Reply-To: <20210315235947.GD16691@gate.crashing.org> From: Segher Boessenkool > Sent: 16 March 2021 00:00 ... > > Although you may need to disable loop unrolling (often dubious at best) > > and either force or disable some function inlining. > > The cases where GCC does loop unrolling at -O2 always help quite a lot. > Or, do you have a counter-example? We'd love to see one. The real problem with loop unrolling is that quite often a modern out-of-order superscaler processor actually has 'spare' execution cycles where the loop control can be done 'for free'. Sometimes you do need to unroll (or interleave) a couple of times to get enough spare execution cycles. But the unrolled loop has to read a lot more code into cache - so unless the code is 'hot cache' (that is usually arranged for benchmarking) those delays apply as well. The larger code footprint also displaces other code. My real annoyance with gcc is unrolling (and vectorizing) loops that I know are never executed as many times as even one copy of the unrolled loop. As an example intel (ivy bridge onwards) cpu execute the following code (the middle of the ip checksum) at 8 bytes/clock. (Limited by the carry flag.) It just doesn't need any further unrolling. + "10: jecxz 20f\n" + " adc (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 32(%[len]), %[len_tmp]\n" + " adc 16(%[buff], %[len]), %[sum_0]\n" + " adc 24(%[buff], %[len]), %[sum_1]\n" + " mov %[len_tmp], %[len]\n" + " jmp 10b\n" Annoyingly that loop is slow on my 8-core atom. The existing code only does 4 bytes/clock on intel cpu prior to either broadwell or haswell (forgotten which) in spite of much more unroling. > And yup, inlining is hard. GCC's heuristics there are very good > nowadays, but any single decision has big effects. Doing the important > spots manually (always_inline or noinline) has good payoff. Latest inline gripe was a function replicated about 20 times when the non-inline version was a register load and 'tail call'. The inlining is just bloat. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
WARNING: multiple messages have this Message-ID (diff)
From: David Laight <David.Laight@ACULAB.COM> To: 'Segher Boessenkool' <segher@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org>, "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, 'Rasmus Villemoes' <rasmus.villemoes@prevas.dk> Subject: RE: [PATCH] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure Date: Tue, 16 Mar 2021 09:35:26 +0000 [thread overview] Message-ID: <e2493e6aaa454604a10dd811a369d104@AcuMS.aculab.com> (raw) In-Reply-To: <20210315235947.GD16691@gate.crashing.org> From: Segher Boessenkool > Sent: 16 March 2021 00:00 ... > > Although you may need to disable loop unrolling (often dubious at best) > > and either force or disable some function inlining. > > The cases where GCC does loop unrolling at -O2 always help quite a lot. > Or, do you have a counter-example? We'd love to see one. The real problem with loop unrolling is that quite often a modern out-of-order superscaler processor actually has 'spare' execution cycles where the loop control can be done 'for free'. Sometimes you do need to unroll (or interleave) a couple of times to get enough spare execution cycles. But the unrolled loop has to read a lot more code into cache - so unless the code is 'hot cache' (that is usually arranged for benchmarking) those delays apply as well. The larger code footprint also displaces other code. My real annoyance with gcc is unrolling (and vectorizing) loops that I know are never executed as many times as even one copy of the unrolled loop. As an example intel (ivy bridge onwards) cpu execute the following code (the middle of the ip checksum) at 8 bytes/clock. (Limited by the carry flag.) It just doesn't need any further unrolling. + "10: jecxz 20f\n" + " adc (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 32(%[len]), %[len_tmp]\n" + " adc 16(%[buff], %[len]), %[sum_0]\n" + " adc 24(%[buff], %[len]), %[sum_1]\n" + " mov %[len_tmp], %[len]\n" + " jmp 10b\n" Annoyingly that loop is slow on my 8-core atom. The existing code only does 4 bytes/clock on intel cpu prior to either broadwell or haswell (forgotten which) in spite of much more unroling. > And yup, inlining is hard. GCC's heuristics there are very good > nowadays, but any single decision has big effects. Doing the important > spots manually (always_inline or noinline) has good payoff. Latest inline gripe was a function replicated about 20 times when the non-inline version was a register load and 'tail call'. The inlining is just bloat. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
next prev parent reply other threads:[~2021-03-16 9:36 UTC|newest] Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-03-09 6:19 [PATCH] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure Christophe Leroy 2021-03-09 6:19 ` Christophe Leroy 2021-03-12 2:29 ` Segher Boessenkool 2021-03-12 2:29 ` Segher Boessenkool 2021-03-15 16:23 ` Rasmus Villemoes 2021-03-15 16:23 ` Rasmus Villemoes 2021-03-15 16:38 ` David Laight 2021-03-15 16:38 ` David Laight 2021-03-15 23:59 ` Segher Boessenkool 2021-03-15 23:59 ` Segher Boessenkool 2021-03-16 9:35 ` David Laight [this message] 2021-03-16 9:35 ` David Laight 2021-03-15 23:47 ` Segher Boessenkool 2021-03-15 23:47 ` Segher Boessenkool 2021-03-12 13:09 ` Christophe Leroy 2021-03-15 13:31 ` Michael Ellerman 2021-03-15 13:31 ` Michael Ellerman
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=e2493e6aaa454604a10dd811a369d104@AcuMS.aculab.com \ --to=david.laight@aculab.com \ --cc=christophe.leroy@csgroup.eu \ --cc=linux-kernel@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=paulus@samba.org \ --cc=rasmus.villemoes@prevas.dk \ --cc=segher@kernel.crashing.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.