All of lore.kernel.org
 help / color / mirror / Atom feed
* CONFIG_OPTIMIZE_INLINING fun
@ 2008-11-14 18:13 Hugh Dickins
  2008-11-14 18:53 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Hugh Dickins @ 2008-11-14 18:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Christoph Lameter, Nick Piggin, linux-kernel

I'm wondering whether we need this patch: though perhaps it doesn't
matter, since OPTIMIZE_INLINING is already under kernel hacking and
defaulted off and there expressly for gathering feedback...

--- 2.6.28-rc4/arch/x86/Kconfig.debug	2008-10-24 09:27:47.000000000 +0100
+++ linux/arch/x86/Kconfig.debug	2008-11-14 16:26:15.000000000 +0000
@@ -302,6 +302,7 @@ config CPA_DEBUG
 
 config OPTIMIZE_INLINING
 	bool "Allow gcc to uninline functions marked 'inline'"
+	depends on !CC_OPTIMIZE_FOR_SIZE
 	help
 	  This option determines if the kernel forces gcc to inline the functions
 	  developers have marked 'inline'. Doing so takes away freedom from gcc to

I've been building with CC_OPTIMIZE_FOR_SIZE=y and OPTIMIZE_INLINING=y
for a while, but I've now taken OPTIMIZE_INLINING off, after noticing
the 83 " Page" and 202 constant_test_bit functions in my System.map:
it appears that the functions in include/linux/page-flags.h (perhaps
others I've not noticed) make OPTIMIZE_INLINING behave very stupidly
when CC_OPTIMIZE_FOR_SIZE is on (and somewhat even when off).

Those constant_test_bit()s show up noticeably in the profile of my
swapping load on an oldish P4 Xeon 2*HT: the average system time for an
iteration is 63.3 seconds when running a kernel built with both options
on, but 49.2 seconds when kernel is built with only CC_OPTIMIZE_FOR_SIZE.
Not put much effort into timing my newer machines: I think there's a
visible but lesser effect.

That was with the gcc 4.2.1 from openSUSE 10.3.  I've since tried the
gcc 4.3.2 from Ubuntu 8.10: which is much better on the " Page"s, only
6 of them - PageUptodate() reasonable though PagePrivate() mysterious;
but still 130 constant_test_bits, which I'd guess are the worst of it,
containing an idiv.  Hmm, with the 4.3.2, I get 77 constant_test_bits
with OPTIMIZE_INLINING on but CC_OPTIMIZE_FOR_SIZE off: that's worse
than 4.2.1, which only gave me 5 of them.  So, the patch above won't
help much then.

You'll be amused to see the asm for this example from mm/swap_state.c
(I was intending to downgrade these BUG_ONs to VM_BUG_ONs anyway, but
this example makes that seem highly desirable):

void __delete_from_swap_cache(struct page *page)
{
	BUG_ON(!PageLocked(page));
	BUG_ON(!PageSwapCache(page));
	BUG_ON(PageWriteback(page));
	BUG_ON(PagePrivate(page));

	radix_tree_delete(&swapper_space.page_tree, page_private(page));
and let's break it off there.

Here's the nice asm 4.2.1 gives with just CONFIG_CC_OPTIMIZE_FOR_SIZE=y
(different machine, this one a laptop with CONFIG_VMSPLIT_2G_OPT=y):

78173430 <__delete_from_swap_cache>:
78173430:	55                   	push   %ebp
78173431:	89 e5                	mov    %esp,%ebp
78173433:	53                   	push   %ebx
78173434:	89 c3                	mov    %eax,%ebx
78173436:	8b 00                	mov    (%eax),%eax
78173438:	a8 01                	test   $0x1,%al
7817343a:	74 45                	je     78173481 <__delete_from_swap_cache+0x51>
7817343c:	66 85 c0             	test   %ax,%ax
7817343f:	79 53                	jns    78173494 <__delete_from_swap_cache+0x64>
78173441:	f6 c4 10             	test   $0x10,%ah
78173444:	75 4a                	jne    78173490 <__delete_from_swap_cache+0x60>
78173446:	f6 c4 08             	test   $0x8,%ah
78173449:	75 3a                	jne    78173485 <__delete_from_swap_cache+0x55>
7817344b:	8b 53 0c             	mov    0xc(%ebx),%edx
7817344e:	b8 a4 9b 51 78       	mov    $0x78519ba4,%eax
78173453:	e8 f8 83 0d 00       	call   7824b850 <radix_tree_delete>

And here is what you get when you add in CONFIG_OPTIMIZE_INLINING=y:

7815eda4 <constant_test_bit>:
7815eda4:	55                   	push   %ebp
7815eda5:	b9 20 00 00 00       	mov    $0x20,%ecx
7815edaa:	89 e5                	mov    %esp,%ebp
7815edac:	53                   	push   %ebx
7815edad:	89 d3                	mov    %edx,%ebx
7815edaf:	99                   	cltd   
7815edb0:	f7 f9                	idiv   %ecx
7815edb2:	8b 04 83             	mov    (%ebx,%eax,4),%eax
7815edb5:	89 d1                	mov    %edx,%ecx
7815edb7:	5b                   	pop    %ebx
7815edb8:	5d                   	pop    %ebp
7815edb9:	d3 e8                	shr    %cl,%eax
7815edbb:	83 e0 01             	and    $0x1,%eax
7815edbe:	c3                   	ret    

7815edbf <PageLocked>:
7815edbf:	55                   	push   %ebp
7815edc0:	89 c2                	mov    %eax,%edx
7815edc2:	89 e5                	mov    %esp,%ebp
7815edc4:	31 c0                	xor    %eax,%eax
7815edc6:	e8 d9 ff ff ff       	call   7815eda4 <constant_test_bit>
7815edcb:	5d                   	pop    %ebp
7815edcc:	c3                   	ret    

7815edcd <PagePrivate>:
7815edcd:	55                   	push   %ebp
7815edce:	89 c2                	mov    %eax,%edx
7815edd0:	89 e5                	mov    %esp,%ebp
7815edd2:	b8 0b 00 00 00       	mov    $0xb,%eax
7815edd7:	e8 c8 ff ff ff       	call   7815eda4 <constant_test_bit>
7815eddc:	5d                   	pop    %ebp
7815eddd:	c3                   	ret    

7815edde <PageSwapCache>:
7815edde:	55                   	push   %ebp
7815eddf:	89 c2                	mov    %eax,%edx
7815ede1:	89 e5                	mov    %esp,%ebp
7815ede3:	b8 0f 00 00 00       	mov    $0xf,%eax
7815ede8:	e8 b7 ff ff ff       	call   7815eda4 <constant_test_bit>
7815eded:	5d                   	pop    %ebp
7815edee:	c3                   	ret    

[ unrelated functions ]

7815eecf <__delete_from_swap_cache>:
7815eecf:	55                   	push   %ebp
7815eed0:	89 e5                	mov    %esp,%ebp
7815eed2:	53                   	push   %ebx
7815eed3:	89 c3                	mov    %eax,%ebx
7815eed5:	e8 e5 fe ff ff       	call   7815edbf <PageLocked>
7815eeda:	85 c0                	test   %eax,%eax
7815eedc:	75 04                	jne    7815eee2 <__delete_from_swap_cache+0x13>
7815eede:	0f 0b                	ud2a   
7815eee0:	eb fe                	jmp    7815eee0 <__delete_from_swap_cache+0x11>
7815eee2:	89 d8                	mov    %ebx,%eax
7815eee4:	e8 f5 fe ff ff       	call   7815edde <PageSwapCache>
7815eee9:	85 c0                	test   %eax,%eax
7815eeeb:	75 04                	jne    7815eef1 <__delete_from_swap_cache+0x22>
7815eeed:	0f 0b                	ud2a   
7815eeef:	eb fe                	jmp    7815eeef <__delete_from_swap_cache+0x20>
7815eef1:	89 da                	mov    %ebx,%edx
7815eef3:	b8 0c 00 00 00       	mov    $0xc,%eax
7815eef8:	e8 a7 fe ff ff       	call   7815eda4 <constant_test_bit>
7815eefd:	85 c0                	test   %eax,%eax
7815eeff:	74 04                	je     7815ef05 <__delete_from_swap_cache+0x36>
7815ef01:	0f 0b                	ud2a   
7815ef03:	eb fe                	jmp    7815ef03 <__delete_from_swap_cache+0x34>
7815ef05:	89 d8                	mov    %ebx,%eax
7815ef07:	e8 c1 fe ff ff       	call   7815edcd <PagePrivate>
7815ef0c:	85 c0                	test   %eax,%eax
7815ef0e:	74 04                	je     7815ef14 <__delete_from_swap_cache+0x45>
7815ef10:	0f 0b                	ud2a   
7815ef12:	eb fe                	jmp    7815ef12 <__delete_from_swap_cache+0x43>
7815ef14:	8b 53 0c             	mov    0xc(%ebx),%edx
7815ef17:	b8 04 16 49 78       	mov    $0x78491604,%eax
7815ef1c:	e8 6a 09 0b 00       	call   7820f88b <radix_tree_delete>

Fun, isn't it?  I particularly admire the way it's somehow managed
not to create a function for PageWriteback - aah, that'll be because
there are no other references to PageWriteback in that unit.  The
4.3.2 asm is much less amusing, but calling constant_test_bit()
each time from __delete_from_swap_cache().

The numbers I've given are all for x86_32: similar story on x86_64,
though I've not spent as much time on that, just noticed all the
" Page"s there and hurried to switch off its OPTIMIZE_INLINING too.

I do wonder whether there's some tweak we could make to page-flags.h
which would stop this nonsense.  Change the inline functions back to
macros?  I suspect that by itself wouldn't work, and my quick attempt
to try it failed abysmally to compile, I've not the cpp foo needed.

A part of the problem may be that test_bit() etc. are designed for
arrays of unsigned longs, but page->flags is only the one unsigned long:
maybe gcc loses track of the optimizations available for that case when
CONFIG_OPTIMIZE_INLINING=y.

Hah, I've just noticed the defaults in arch/x86/configs -
you might want to change those...

Hugh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 18:13 CONFIG_OPTIMIZE_INLINING fun Hugh Dickins
@ 2008-11-14 18:53 ` Christoph Lameter
  2008-11-14 19:38   ` Hugh Dickins
  2008-11-15  7:57 ` Sam Ravnborg
  2008-11-15 13:21 ` Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2008-11-14 18:53 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ingo Molnar, Nick Piggin, linux-kernel

On Fri, 14 Nov 2008, Hugh Dickins wrote:

> --- 2.6.28-rc4/arch/x86/Kconfig.debug	2008-10-24 09:27:47.000000000 +0100
> +++ linux/arch/x86/Kconfig.debug	2008-11-14 16:26:15.000000000 +0000
> @@ -302,6 +302,7 @@ config CPA_DEBUG
>
>  config OPTIMIZE_INLINING
>  	bool "Allow gcc to uninline functions marked 'inline'"
> +	depends on !CC_OPTIMIZE_FOR_SIZE
>  	help
>  	  This option determines if the kernel forces gcc to inline the functions
>  	  developers have marked 'inline'. Doing so takes away freedom from gcc to
>

Maybe add some text explaining that this is some experimental gcc feature?

> You'll be amused to see the asm for this example from mm/swap_state.c
> (I was intending to downgrade these BUG_ONs to VM_BUG_ONs anyway, but
> this example makes that seem highly desirable):
>
> void __delete_from_swap_cache(struct page *page)
> {
> 	BUG_ON(!PageLocked(page));
> 	BUG_ON(!PageSwapCache(page));
> 	BUG_ON(PageWriteback(page));
> 	BUG_ON(PagePrivate(page));

Maybe checking all the conditionals in one BUG_ON is not such a bad idea
after all? 4 useless branches in a hotpath?

> Here's the nice asm 4.2.1 gives with just CONFIG_CC_OPTIMIZE_FOR_SIZE=y
> (different machine, this one a laptop with CONFIG_VMSPLIT_2G_OPT=y):

Optimize for size is what I also have usually on. Inline "optimization"
never worked as far as I can tell so I usually have that off.

> I do wonder whether there's some tweak we could make to page-flags.h
> which would stop this nonsense.  Change the inline functions back to
> macros?  I suspect that by itself wouldn't work, and my quick attempt
> to try it failed abysmally to compile, I've not the cpp foo needed.

I think we want to get away from macros as much as possible.

> A part of the problem may be that test_bit() etc. are designed for
> arrays of unsigned longs, but page->flags is only the one unsigned long:
> maybe gcc loses track of the optimizations available for that case when
> CONFIG_OPTIMIZE_INLINING=y.
>
> Hah, I've just noticed the defaults in arch/x86/configs -
> you might want to change those...

Argh.. Yes.... CONFIG_OPTIMIZE_INLINING must only be enabled if the
compiler can actually create a benefit from optimizing the
inlining. That could be in gcc 4.5 or so I guess.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 18:53 ` Christoph Lameter
@ 2008-11-14 19:38   ` Hugh Dickins
  2008-11-14 19:46     ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2008-11-14 19:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Ingo Molnar, Nick Piggin, linux-kernel

On Fri, 14 Nov 2008, Christoph Lameter wrote:
> On Fri, 14 Nov 2008, Hugh Dickins wrote:
> > (I was intending to downgrade these BUG_ONs to VM_BUG_ONs anyway, but
> > this example makes that seem highly desirable):
> >
> > void __delete_from_swap_cache(struct page *page)
> > {
> > 	BUG_ON(!PageLocked(page));
> > 	BUG_ON(!PageSwapCache(page));
> > 	BUG_ON(PageWriteback(page));
> > 	BUG_ON(PagePrivate(page));
> 
> Maybe checking all the conditionals in one BUG_ON is not such a bad idea
> after all? 4 useless branches in a hotpath?

That is usually frowned on, since on those rare occasions you hit
the BUG_ON, you're left wondering which of the four cases it hit.
(I suppose it might show up in the stack trace or registers.)

What I was intending anyway, quite independently of the INLINING
issue, was changing those and some others to VM_BUG_ONs, which are
intended really for VM testers rather than for distros to turn on.
(Though perhaps Nick has shifted his position on that.)

And the last of them, the PagePrivate one, I was just going to
remove along with quite a few other BUG_ON(PagePrivate)s: they were
interesting at the time we started storing swpentry in page->private,
without setting PagePrivate, when before page->private had always(?)
contained a buffer pointer with PagePrivate set; but of no interest
nowadays, I'd say.

> > I do wonder whether there's some tweak we could make to page-flags.h
> > which would stop this nonsense.  Change the inline functions back to
> > macros?  I suspect that by itself wouldn't work, and my quick attempt
> > to try it failed abysmally to compile, I've not the cpp foo needed.
> 
> I think we want to get away from macros as much as possible.

That is indeed the orthodoxy.  I've never been so sold on it as most
(there are cases when inlines give superior typechecking, but often
the use of the macro will catch wrong types anyway).  But I suspect
it's irrelevant, that changing those functions to macros would not
actually have any effect on the problem - that's what we've often
been assured, anyway, that the compiler nowadays does inlines as
efficiently as the preprocessor does macros.  I do wonder though.

Hugh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 19:38   ` Hugh Dickins
@ 2008-11-14 19:46     ` Christoph Lameter
  2008-11-15  0:19       ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2008-11-14 19:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ingo Molnar, Nick Piggin, linux-kernel

On Fri, 14 Nov 2008, Hugh Dickins wrote:

> What I was intending anyway, quite independently of the INLINING
> issue, was changing those and some others to VM_BUG_ONs, which are
> intended really for VM testers rather than for distros to turn on.
> (Though perhaps Nick has shifted his position on that.)

Some distros have a bad habit of turning these on for production releases.

> That is indeed the orthodoxy.  I've never been so sold on it as most
> (there are cases when inlines give superior typechecking, but often
> the use of the macro will catch wrong types anyway).  But I suspect
> it's irrelevant, that changing those functions to macros would not
> actually have any effect on the problem - that's what we've often
> been assured, anyway, that the compiler nowadays does inlines as
> efficiently as the preprocessor does macros.  I do wonder though.

Maybe try to compare it with a old kernel that still has the page flags
macros? That way we would have a testcase useful for bringing to the
attention of the gcc people.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 19:46     ` Christoph Lameter
@ 2008-11-15  0:19       ` Hugh Dickins
  0 siblings, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2008-11-15  0:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Ingo Molnar, Nick Piggin, linux-kernel

On Fri, 14 Nov 2008, Christoph Lameter wrote:
> On Fri, 14 Nov 2008, Hugh Dickins wrote:
> 
> > What I was intending anyway, quite independently of the INLINING
> > issue, was changing those and some others to VM_BUG_ONs, which are
> > intended really for VM testers rather than for distros to turn on.
> > (Though perhaps Nick has shifted his position on that.)
> 
> Some distros have a bad habit of turning these on for production releases.

Okay, then they'll get what they've asked for.  DEBUG_VM has
an outstandingly appropriate help text, Nick deserves a medal:

  Enable this to turn on extended checks in the virtual-memory system
  that may impact performance.

  If unsure, say N.

> > But I suspect
> > it's irrelevant, that changing those functions to macros would not
> > actually have any effect on the problem - that's what we've often
> > been assured, anyway, that the compiler nowadays does inlines as
> > efficiently as the preprocessor does macros.  I do wonder though.
> 
> Maybe try to compare it with a old kernel that still has the page flags
> macros? That way we would have a testcase useful for bringing to the
> attention of the gcc people.

I've now put a 2.6.25 #define-style page-flags.h into my 2.6.28-rc4
build tree, added in enough of the missing stuff to build my config,
and rebuilt with gccs 4.2.1 and 4.3.2.

You'll be relieved to hear that using the macros makes no difference:
well, of course it eliminates all those " Page" functions with 4.2.1,
hard for it not to; but leaves just as many constant_test_bit()s.

Hugh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 18:13 CONFIG_OPTIMIZE_INLINING fun Hugh Dickins
  2008-11-14 18:53 ` Christoph Lameter
@ 2008-11-15  7:57 ` Sam Ravnborg
  2008-11-15 13:21 ` Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Sam Ravnborg @ 2008-11-15  7:57 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ingo Molnar, Christoph Lameter, Nick Piggin, linux-kernel

> 
> I do wonder whether there's some tweak we could make to page-flags.h
> which would stop this nonsense.  Change the inline functions back to
> macros?  I suspect that by itself wouldn't work, and my quick attempt
> to try it failed abysmally to compile, I've not the cpp foo needed.

Try declaring the inline functions as __always_inline
This undo the effect of CONFIG_OPTIMIZE_INLINING=n.

	Sam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-14 18:13 CONFIG_OPTIMIZE_INLINING fun Hugh Dickins
  2008-11-14 18:53 ` Christoph Lameter
  2008-11-15  7:57 ` Sam Ravnborg
@ 2008-11-15 13:21 ` Andi Kleen
  2008-11-15 15:49   ` Hugh Dickins
  2 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2008-11-15 13:21 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ingo Molnar, Christoph Lameter, Nick Piggin, linux-kernel

Hugh Dickins <hugh@veritas.com> writes:

> I'm wondering whether we need this patch: though perhaps it doesn't
> matter, since OPTIMIZE_INLINING is already under kernel hacking and
> defaulted off and there expressly for gathering feedback...
>
> --- 2.6.28-rc4/arch/x86/Kconfig.debug	2008-10-24 09:27:47.000000000 +0100
> +++ linux/arch/x86/Kconfig.debug	2008-11-14 16:26:15.000000000 +0000
> @@ -302,6 +302,7 @@ config CPA_DEBUG
>  
>  config OPTIMIZE_INLINING
>  	bool "Allow gcc to uninline functions marked 'inline'"
> +	depends on !CC_OPTIMIZE_FOR_SIZE
>  	help
>  	  This option determines if the kernel forces gcc to inline the functions
>  	  developers have marked 'inline'. Doing so takes away freedom from gcc to
>
> I've been building with CC_OPTIMIZE_FOR_SIZE=y and OPTIMIZE_INLINING=y
> for a while, but I've now taken OPTIMIZE_INLINING off, after noticing
> the 83 " Page" and 202 constant_test_bit functions in my System.map:
> it appears that the functions in include/linux/page-flags.h (perhaps
> others I've not noticed) make OPTIMIZE_INLINING behave very stupidly
> when CC_OPTIMIZE_FOR_SIZE is on (and somewhat even when off).

I believe newer gcc 4.3/4.4 have improved heuristics to fix this
case by eliminating code that gets optimized away better before
inlining. But that doesn't help with older compilers.

The right fix imho is to mark all of these functions which require 
inlining as __always_inline. In theory this would be all of bitops.h
but my understanding is that for inlines with only a single
asm statement the inliner heuristics in gcc already work well.
The problem is only in more complex functions like __constant_test_bit

Initial patch for bitops.h appended.

-Andi

---

Mark complex bitops.h inlines as __always_inline

Hugh Dickins noticed that older gcc versions when the kernel
is built for code size didn't inline some of the bitops.

Mark all complex x86 bitops that have more than a single
asm statement or two as always inline to avoid this problem.

Probably should be done for other architectures too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/include/asm/bitops.h      |   15 +++++++++++----
 include/asm-generic/bitops/__ffs.h |    2 +-
 include/asm-generic/bitops/__fls.h |    2 +-
 include/asm-generic/bitops/fls.h   |    2 +-
 include/asm-generic/bitops/fls64.h |    4 ++--
 5 files changed, 16 insertions(+), 9 deletions(-)

Index: linux-2.6.28-rc4-test/arch/x86/include/asm/bitops.h
===================================================================
--- linux-2.6.28-rc4-test.orig/arch/x86/include/asm/bitops.h	2008-10-24 13:34:40.000000000 +0200
+++ linux-2.6.28-rc4-test/arch/x86/include/asm/bitops.h	2008-11-15 14:16:10.000000000 +0100
@@ -3,6 +3,9 @@
 
 /*
  * Copyright 1992, Linus Torvalds.
+ *
+ * Note: inlines with more than a single statement should be marked
+ * __always_inline to avoid problems with older gcc's inlining heuristics.
  */
 
 #ifndef _LINUX_BITOPS_H
@@ -53,7 +56,8 @@
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
+static __always_inline void
+set_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "orb %1,%0"
@@ -90,7 +94,8 @@
  * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
  */
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static __always_inline void
+clear_bit(int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -196,7 +201,8 @@
  *
  * This is the same as test_and_set_bit on x86.
  */
-static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
+static __always_inline int
+test_and_set_bit_lock(int nr, volatile unsigned long *addr)
 {
 	return test_and_set_bit(nr, addr);
 }
@@ -292,7 +298,8 @@
 	return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __always_inline int
+constant_test_bit(int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr % BITS_PER_LONG)) &
 		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/__ffs.h
===================================================================
--- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/__ffs.h	2006-04-03 16:06:13.000000000 +0200
+++ linux-2.6.28-rc4-test/include/asm-generic/bitops/__ffs.h	2008-11-15 14:18:01.000000000 +0100
@@ -9,7 +9,7 @@
  *
  * Undefined if no bit exists, so code should check against 0 first.
  */
-static inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __ffs(unsigned long word)
 {
 	int num = 0;
 
Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/__fls.h
===================================================================
--- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/__fls.h	2008-05-08 12:56:05.000000000 +0200
+++ linux-2.6.28-rc4-test/include/asm-generic/bitops/__fls.h	2008-11-15 14:18:12.000000000 +0100
@@ -9,7 +9,7 @@
  *
  * Undefined if no set bit exists, so code should check against 0 first.
  */
-static inline unsigned long __fls(unsigned long word)
+static __always_inline unsigned long __fls(unsigned long word)
 {
 	int num = BITS_PER_LONG - 1;
 
Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/fls.h
===================================================================
--- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/fls.h	2006-04-03 16:06:13.000000000 +0200
+++ linux-2.6.28-rc4-test/include/asm-generic/bitops/fls.h	2008-11-15 14:17:34.000000000 +0100
@@ -9,7 +9,7 @@
  * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
  */
 
-static inline int fls(int x)
+static __always_inline int fls(int x)
 {
 	int r = 32;
 
Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/fls64.h
===================================================================
--- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/fls64.h	2008-05-08 12:56:05.000000000 +0200
+++ linux-2.6.28-rc4-test/include/asm-generic/bitops/fls64.h	2008-11-15 14:17:38.000000000 +0100
@@ -15,7 +15,7 @@
  * at position 64.
  */
 #if BITS_PER_LONG == 32
-static inline int fls64(__u64 x)
+static __always_inline int fls64(__u64 x)
 {
 	__u32 h = x >> 32;
 	if (h)
@@ -23,7 +23,7 @@
 	return fls(x);
 }
 #elif BITS_PER_LONG == 64
-static inline int fls64(__u64 x)
+static __always_inline int fls64(__u64 x)
 {
 	if (x == 0)
 		return 0;


-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-15 13:21 ` Andi Kleen
@ 2008-11-15 15:49   ` Hugh Dickins
  2008-11-15 16:25     ` Andi Kleen
       [not found]     ` <20090211042758.GV21351@cordes.ca>
  0 siblings, 2 replies; 10+ messages in thread
From: Hugh Dickins @ 2008-11-15 15:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Nick Piggin, Sam Ravnborg, linux-kernel

On Sat, 15 Nov 2008, Andi Kleen wrote:
> Hugh Dickins <hugh@veritas.com> writes:
> 
> > I'm wondering whether we need this patch: though perhaps it doesn't
> > matter, since OPTIMIZE_INLINING is already under kernel hacking and
> > defaulted off and there expressly for gathering feedback...
> >
> > --- 2.6.28-rc4/arch/x86/Kconfig.debug	2008-10-24 09:27:47.000000000 +0100
> > +++ linux/arch/x86/Kconfig.debug	2008-11-14 16:26:15.000000000 +0000
> > @@ -302,6 +302,7 @@ config CPA_DEBUG
> >  
> >  config OPTIMIZE_INLINING
> >  	bool "Allow gcc to uninline functions marked 'inline'"
> > +	depends on !CC_OPTIMIZE_FOR_SIZE
> >  	help
> >  	  This option determines if the kernel forces gcc to inline the functions
> >  	  developers have marked 'inline'. Doing so takes away freedom from gcc to
> >
> > I've been building with CC_OPTIMIZE_FOR_SIZE=y and OPTIMIZE_INLINING=y
> > for a while, but I've now taken OPTIMIZE_INLINING off, after noticing
> > the 83 " Page" and 202 constant_test_bit functions in my System.map:
> > it appears that the functions in include/linux/page-flags.h (perhaps
> > others I've not noticed) make OPTIMIZE_INLINING behave very stupidly
> > when CC_OPTIMIZE_FOR_SIZE is on (and somewhat even when off).
> 
> I believe newer gcc 4.3/4.4 have improved heuristics to fix this
> case by eliminating code that gets optimized away better before
> inlining. But that doesn't help with older compilers.

gcc 4.3.2 wasn't doing well enough: it was still generating lots
of constant_test_bits, even without CONFIG_CC_OPTIMIZE_FOR_SIZE.
Mind you, I've just checked its __delete_from_swap_cache is okay,
doing efficient tests without calling anything else.  Maybe 4.3.2
is actually making a good judgement of when constant_test_bit is
appropriate (maybe: I don't know) - and the problem is rather that
we get so many copies of the same code, one per compilation unit.

I'd have liked to try 4.4, but seem unable to download what's needed
for it today: maybe someone else has a snapshot already and can say
whether it too puts lots of constant_test_bit lines in System.map
when the kernel is built with CONFIG_OPTIMIZE_INLINING=y (with and
without CONFIG_CC_OPTIMIZE_FOR_SIZE=y).

> 
> The right fix imho is to mark all of these functions which require 
> inlining as __always_inline. In theory this would be all of bitops.h
> but my understanding is that for inlines with only a single
> asm statement the inliner heuristics in gcc already work well.
> The problem is only in more complex functions like __constant_test_bit

I did follow Sam's advice, and once I'd changed all the inlines
in include/linux/page-flags.h and include/linux/page_cgroup.h
and arch/x86/include/asm/bitops.h to __always_inline, then yes,
gcc 4.2.1 stopped giving me lots of constant_test_bits and " Page"
functions with CC_OPTIMIZE_FOR_SIZE=y and OPTIMIZE_INLINING=y, and
the code generated for __delete_from_swap_cache was reasonable.

But how do we know to stop there?  that's just as far as I'd
noticed.  It is quite conceivable that constant_test_bit poses the
biggest challenge to gcc, but there may be other bad examples too.

If we're reduced to adding __always_inline in lots of places,
doesn't that mean CONFIG_OPTIMIZE_INLINING just isn't working?
that we'd be better just to switch it off?  Because later on
some other improvement will be made, and we'll want the things
already marked __always_inline to become __usually_inline_but_
_not_if_new_gcc_knows_better, etc.  Our unceasing inline battles.

I'd have thought it better to leave the "inline"s as they are,
but retune the __GNUC__ version in compiler-gcc.h, and reflect
what's known in the advice in the OPTIMIZE_INLINING help text.

But I'm certainly no expert on this.

Hugh

> 
> Initial patch for bitops.h appended.
> 
> -Andi
> 
> ---
> 
> Mark complex bitops.h inlines as __always_inline
> 
> Hugh Dickins noticed that older gcc versions when the kernel
> is built for code size didn't inline some of the bitops.
> 
> Mark all complex x86 bitops that have more than a single
> asm statement or two as always inline to avoid this problem.
> 
> Probably should be done for other architectures too.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  arch/x86/include/asm/bitops.h      |   15 +++++++++++----
>  include/asm-generic/bitops/__ffs.h |    2 +-
>  include/asm-generic/bitops/__fls.h |    2 +-
>  include/asm-generic/bitops/fls.h   |    2 +-
>  include/asm-generic/bitops/fls64.h |    4 ++--
>  5 files changed, 16 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6.28-rc4-test/arch/x86/include/asm/bitops.h
> ===================================================================
> --- linux-2.6.28-rc4-test.orig/arch/x86/include/asm/bitops.h	2008-10-24 13:34:40.000000000 +0200
> +++ linux-2.6.28-rc4-test/arch/x86/include/asm/bitops.h	2008-11-15 14:16:10.000000000 +0100
> @@ -3,6 +3,9 @@
>  
>  /*
>   * Copyright 1992, Linus Torvalds.
> + *
> + * Note: inlines with more than a single statement should be marked
> + * __always_inline to avoid problems with older gcc's inlining heuristics.
>   */
>  
>  #ifndef _LINUX_BITOPS_H
> @@ -53,7 +56,8 @@
>   * Note that @nr may be almost arbitrarily large; this function is not
>   * restricted to acting on a single-word quantity.
>   */
> -static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
> +static __always_inline void
> +set_bit(unsigned int nr, volatile unsigned long *addr)
>  {
>  	if (IS_IMMEDIATE(nr)) {
>  		asm volatile(LOCK_PREFIX "orb %1,%0"
> @@ -90,7 +94,8 @@
>   * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
>   * in order to ensure changes are visible on other processors.
>   */
> -static inline void clear_bit(int nr, volatile unsigned long *addr)
> +static __always_inline void
> +clear_bit(int nr, volatile unsigned long *addr)
>  {
>  	if (IS_IMMEDIATE(nr)) {
>  		asm volatile(LOCK_PREFIX "andb %1,%0"
> @@ -196,7 +201,8 @@
>   *
>   * This is the same as test_and_set_bit on x86.
>   */
> -static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> +static __always_inline int
> +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
>  {
>  	return test_and_set_bit(nr, addr);
>  }
> @@ -292,7 +298,8 @@
>  	return oldbit;
>  }
>  
> -static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
> +static __always_inline int
> +constant_test_bit(int nr, const volatile unsigned long *addr)
>  {
>  	return ((1UL << (nr % BITS_PER_LONG)) &
>  		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/__ffs.h
> ===================================================================
> --- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/__ffs.h	2006-04-03 16:06:13.000000000 +0200
> +++ linux-2.6.28-rc4-test/include/asm-generic/bitops/__ffs.h	2008-11-15 14:18:01.000000000 +0100
> @@ -9,7 +9,7 @@
>   *
>   * Undefined if no bit exists, so code should check against 0 first.
>   */
> -static inline unsigned long __ffs(unsigned long word)
> +static __always_inline unsigned long __ffs(unsigned long word)
>  {
>  	int num = 0;
>  
> Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/__fls.h
> ===================================================================
> --- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/__fls.h	2008-05-08 12:56:05.000000000 +0200
> +++ linux-2.6.28-rc4-test/include/asm-generic/bitops/__fls.h	2008-11-15 14:18:12.000000000 +0100
> @@ -9,7 +9,7 @@
>   *
>   * Undefined if no set bit exists, so code should check against 0 first.
>   */
> -static inline unsigned long __fls(unsigned long word)
> +static __always_inline unsigned long __fls(unsigned long word)
>  {
>  	int num = BITS_PER_LONG - 1;
>  
> Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/fls.h
> ===================================================================
> --- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/fls.h	2006-04-03 16:06:13.000000000 +0200
> +++ linux-2.6.28-rc4-test/include/asm-generic/bitops/fls.h	2008-11-15 14:17:34.000000000 +0100
> @@ -9,7 +9,7 @@
>   * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
>   */
>  
> -static inline int fls(int x)
> +static __always_inline int fls(int x)
>  {
>  	int r = 32;
>  
> Index: linux-2.6.28-rc4-test/include/asm-generic/bitops/fls64.h
> ===================================================================
> --- linux-2.6.28-rc4-test.orig/include/asm-generic/bitops/fls64.h	2008-05-08 12:56:05.000000000 +0200
> +++ linux-2.6.28-rc4-test/include/asm-generic/bitops/fls64.h	2008-11-15 14:17:38.000000000 +0100
> @@ -15,7 +15,7 @@
>   * at position 64.
>   */
>  #if BITS_PER_LONG == 32
> -static inline int fls64(__u64 x)
> +static __always_inline int fls64(__u64 x)
>  {
>  	__u32 h = x >> 32;
>  	if (h)
> @@ -23,7 +23,7 @@
>  	return fls(x);
>  }
>  #elif BITS_PER_LONG == 64
> -static inline int fls64(__u64 x)
> +static __always_inline int fls64(__u64 x)
>  {
>  	if (x == 0)
>  		return 0;
> 
> 
> -- 
> ak@linux.intel.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
  2008-11-15 15:49   ` Hugh Dickins
@ 2008-11-15 16:25     ` Andi Kleen
       [not found]     ` <20090211042758.GV21351@cordes.ca>
  1 sibling, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2008-11-15 16:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Christoph Lameter, Nick Piggin, Sam Ravnborg,
	linux-kernel, jh


> gcc 4.3.2 wasn't doing well enough: it was still generating lots
> of constant_test_bits, even without CONFIG_CC_OPTIMIZE_FOR_SIZE.

Hmm I thought the heuristics had been improved in 4.3, or maybe it
was 4.4 only.

> Mind you, I've just checked its __delete_from_swap_cache is okay,
> doing efficient tests without calling anything else.  Maybe 4.3.2
> is actually making a good judgement of when constant_test_bit is
> appropriate (maybe: I don't know) 

constant_test_bit is supposed to be optimized away, so it
should never be inline ideally. The problem with the older
gcc heuristics was that it just counted code to
determine if a function is suitable for inlining
or not, without checking how much of it can be constant
evaluated first.

- and the problem is rather that
> we get so many copies of the same code, one per compilation unit.
> 
> I'd have liked to try 4.4, but seem unable to download what's needed
> for it today: maybe someone else has a snapshot already and can say
> whether it too puts lots of constant_test_bit lines in System.map
> when the kernel is built with CONFIG_OPTIMIZE_INLINING=y (with and
> without CONFIG_CC_OPTIMIZE_FOR_SIZE=y).

I tried with the older gcc 4.4 snapshot I had around, but it
ICEed on setup_percpu.c unfortunately.

> I did follow Sam's advice, and once I'd changed all the inlines
> in include/linux/page-flags.h and include/linux/page_cgroup.h
> and arch/x86/include/asm/bitops.h to __always_inline, then yes,
> gcc 4.2.1 stopped giving me lots of constant_test_bits and " Page"
> functions with CC_OPTIMIZE_FOR_SIZE=y and OPTIMIZE_INLINING=y, and
> the code generated for __delete_from_swap_cache was reasonable.
> 
> But how do we know to stop there?  that's just as far as I'd
> noticed.  It is quite conceivable that constant_test_bit poses the
> biggest challenge to gcc, but there may be other bad examples too.

As long as there isn't significant code in the inline that
is expected to be optimized away then this shouldn't be needed.

BTW an alternative would be to also tweak the respective
thresholds in gcc using --param, but the problem is that
the behaviour of those changes between options so I don't
think it's a good idea.

> If we're reduced to adding __always_inline in lots of places,
> doesn't that mean CONFIG_OPTIMIZE_INLINING just isn't working?

It's not working for the case when a lot of code in the inline
is expected to be optimized away (as is the case with constant_test_bit
which is expected to be 100% optimized). Otherwise it should work.

BTW that is why it is also not needed to mark
all of bitops.h, but only a few functions there (see my patch)


> that we'd be better just to switch it off?  Because later on
> some other improvement will be made, and we'll want the things
> already marked __always_inline to become __usually_inline_but_
> _not_if_new_gcc_knows_better, etc.  Our unceasing inline battles.
> 
> I'd have thought it better to leave the "inline"s as they are,
> but retune the __GNUC__ version in compiler-gcc.h, and reflect
> what's known in the advice in the OPTIMIZE_INLINING help text.

If we had a lot of inlines like that I would I agree with you,
but I think there are few enough of them that manually
marking them is better.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CONFIG_OPTIMIZE_INLINING fun
       [not found]     ` <20090211042758.GV21351@cordes.ca>
@ 2009-02-11  8:32       ` Andi Kleen
  0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2009-02-11  8:32 UTC (permalink / raw)
  To: Peter Cordes; +Cc: linux-kernel


Thanks for the detailed analysis.

It would be probably most productive if you could extract
compileable test cases and file them in gcc bugzilla, so that the gcc 
developers can address these problems.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-02-11  8:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-14 18:13 CONFIG_OPTIMIZE_INLINING fun Hugh Dickins
2008-11-14 18:53 ` Christoph Lameter
2008-11-14 19:38   ` Hugh Dickins
2008-11-14 19:46     ` Christoph Lameter
2008-11-15  0:19       ` Hugh Dickins
2008-11-15  7:57 ` Sam Ravnborg
2008-11-15 13:21 ` Andi Kleen
2008-11-15 15:49   ` Hugh Dickins
2008-11-15 16:25     ` Andi Kleen
     [not found]     ` <20090211042758.GV21351@cordes.ca>
2009-02-11  8:32       ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.