On Wednesday 23 July 2003 22:27, Christoph Hellwig wrote: > On Wed, Jul 23, 2003 at 01:22:56PM -0700, David S. Miller wrote: > > Drivers weren't audited much, and there's a lot of boneheaded > > stuff in this area. But these should be mostly identical > > to what would happen on the 2.4.x side > > Please read the original message again - he stated that every single > module in fs/ got alot bigger - if it gets smaller or at least the > same size as 2.4 it's clearly a sign of inlines gone mad in the > filesystem/VM code and we need to look at that. If not we have to look > elsewhere. I have my humbling opinion: In 2.4.20 (m68knommu): ------------------------------------------------------------------------- #define current _current_task ------------------------------------------------------------------------- In 2.6.0-test1 (m68knommu): ------------------------------------------------------------------------- #define current get_current() static inline struct task_struct *get_current(void) { return(current_thread_info()->task); } static inline struct thread_info *current_thread_info(void) { struct thread_info *ti; __asm__( "move.l %%sp, %0 \n\t" "and.l %1, %0" : "=&d"(ti) : "d" (~(THREAD_SIZE-1)) ); return ti; } ------------------------------------------------------------------------- The latter expands to: 0: movel #-8192,%d0 6: movel %sp,%d2 8: andl %d0,%d2 a: moveal %d2,%a1 c: moveal %a1@,%a0 e: moveal %a0@(92),%a0 12: It's a sequence of 6 instructions, 18 bytes long, clobbering 4 registers. The compiler cannot see around it. "current" is being used very lightly all over the kernel, like in this code snippet from fs/open.c: old_fsuid = current->fsuid; old_fsgid = current->fsgid; old_cap = current->cap_effective; current->fsuid = current->uid; current->fsgid = current->gid; if (current->uid) cap_clear(current->cap_effective); else current->cap_effective = current->cap_permitted; This takes 18*11 = 198 bytes just for invoking the 'current' macro so many times. Perhaps adding __attribute__((const)) on current_thread_info() and get_current() would help eliminating some unnecessary accesses. -- // Bernardo Innocenti - Develer S.r.l., R&D dept. \X/ http://www.develer.com/ Please don't send Word attachments - http://www.gnu.org/philosophy/no-word-attachments.html
On Mer, 2003-07-23 at 23:35, Bernardo Innocenti wrote:
> It's a sequence of 6 instructions, 18 bytes long, clobbering 4 registers.
> The compiler cannot see around it.
> This takes 18*11 = 198 bytes just for invoking the 'current'
> macro so many times.
Unless you support SMP I'm not sure I understand why m68k nommu changed
from using a global for current_task ?
On Thursday 24 July 2003 00:37, Alan Cox wrote: > On Mer, 2003-07-23 at 23:35, Bernardo Innocenti wrote: > > It's a sequence of 6 instructions, 18 bytes long, clobbering 4 registers. > > The compiler cannot see around it. > > This takes 18*11 = 198 bytes just for invoking the 'current' > > macro so many times. > > Unless you support SMP I'm not sure I understand why m68k nommu changed > from using a global for current_task ? The people who might know best are Greg and David from SnapGear. I'm appending them to the Cc list. But I noticed that most archs in 2.6 do like this. Is it some kind of flock-effect? Things get changed in i386 and all other archs just follow... :-) -- // Bernardo Innocenti - Develer S.r.l., R&D dept. \X/ http://www.develer.com/ Please don't send Word attachments - http://www.gnu.org/philosophy/no-word-attachments.html
Jivin Bernardo Innocenti lays it down ... > On Thursday 24 July 2003 00:37, Alan Cox wrote: > > > On Mer, 2003-07-23 at 23:35, Bernardo Innocenti wrote: > > > It's a sequence of 6 instructions, 18 bytes long, clobbering 4 registers. > > > The compiler cannot see around it. > > > This takes 18*11 = 198 bytes just for invoking the 'current' > > > macro so many times. > > > > Unless you support SMP I'm not sure I understand why m68k nommu changed > > from using a global for current_task ? > > The people who might know best are Greg and David from SnapGear. > I'm appending them to the Cc list. > > But I noticed that most archs in 2.6 do like this. Is it some kind > of flock-effect? Things get changed in i386 and all other archs > just follow... :-) It's a little this way for sure. Back when I first did the 2.4 uClinux port, the m68k MMU code was dedicating a register (a2) for current. I thought that was a bad idea given how often you run out of registers on the 68k, and made it a global. Because it was still effectively a pointer, the code size change was not a factor. I just didn't want to give up a register. So that is the 2.4 history and it has served us well so far ;-) On the 2.5/2.6 front, I think the change comes from the 8K (2 page) task structure and everyone just masking the kernel stack pointer to get the task pointer. Gerg would know for sure, he did the 2.5 work in this area. We should be easily able to switch back to the current_task pointer with a few small mods to entry.S. A general comment on the use of inline throughout the kernel. Although they may show gains on x86 platforms, they often perform worse on embedded processors with limited cache, as well as adding size. I can't see any way of coding around this though. As long as x86 is driving influence, other platforms will jut have to deal with it as best they can. Cheers, Davidm -- David McCullough, davidm@snapgear.com Ph:+61 7 34352815 http://www.SnapGear.com Custom Embedded Solutions + Security Fx:+61 7 38913630 http://www.uCdot.org
Bernardo Innocenti wrote: > On Wednesday 23 July 2003 22:27, Christoph Hellwig wrote: > >>On Wed, Jul 23, 2003 at 01:22:56PM -0700, David S. Miller wrote: >>>Drivers weren't audited much, and there's a lot of boneheaded >>>stuff in this area. But these should be mostly identical >>>to what would happen on the 2.4.x side >> >>Please read the original message again - he stated that every single >>module in fs/ got alot bigger - if it gets smaller or at least the >>same size as 2.4 it's clearly a sign of inlines gone mad in the >>filesystem/VM code and we need to look at that. If not we have to look >>elsewhere. > > I have my humbling opinion: > > In 2.4.20 (m68knommu): > ------------------------------------------------------------------------- > #define current _current_task > ------------------------------------------------------------------------- > > In 2.6.0-test1 (m68knommu): > ------------------------------------------------------------------------- > static inline struct task_struct *get_current(void) > { [cut] > } > static inline struct thread_info *current_thread_info(void) > { [cut] > } > ------------------------------------------------------------------------- > > This takes 18*11 = 198 bytes just for invoking the 'current' > macro so many times. > Just curious. Is there any way to guess inline from inline? I mean 'inline' which means 'this has to be inlined or it will break' and 'inline' which means 'inline this please - it adds only 10k of code bloat and improve performance in my suppa-puppa-bench by 0.000001%!' Strictly speaking - separate 'inline' to 'require_inline' and 'better_inline'. So people who really care about image size - can turn 'better_inline' into void, without harm to functionality. Actually I saw real performance improvements on my Pentium MMX 133 (it has $i16k+$d16k of caches I beleive) when I was cutting some of inlines out. and I'm not talking about (cache poor) embedded systems...
David McCullough wrote:
>
> A general comment on the use of inline throughout the kernel. Although
> they may show gains on x86 platforms, they often perform worse on
> embedded processors with limited cache, as well as adding size. I
> can't see any way of coding around this though. As long as x86 is
> driving influence, other platforms will jut have to deal with it as
> best they can.
>
Actually I'm victim on over inlining too. Was at least.
I was running some router on old Pentium's. I remember almost
dramatical drop of performance with newer kernels because of inlining in
net/*. But sure on Xeon P4 it boosts performance...
Actually what I'm about.
We have classical situation when we have mess of representation and
intentions.
Representation == 'inline', but intentions - 'inline or it will
break' _and_ 'inline - it runs faster'.
This obviously should be separated.
even more.
#define INLINE_LEVEL some_platform_specific_number
---------
#define inline0 inline_always
#if INLINE_LEVEL >= 1
# define inline1 inline_always
#else
# define inline1
#endif
...
#if INLINE_LEVEL >= N
# define inlineN inline_always
#else
# define inlineN
#endif
and so on, giving a platform chance to influence amount of inlining.
better to put it into config with defined by platform defaults.
On Iau, 2003-07-24 at 06:06, David McCullough wrote: > Back when I first did the 2.4 uClinux port, the m68k MMU code was > dedicating a register (a2) for current. I thought that was a bad idea > given how often you run out of registers on the 68k, and made it a On some platforms a global register current was a win, I can't speak for m68k - current is used a lot. > On the 2.5/2.6 front, I think the change comes from the 8K (2 page) task > structure and everyone just masking the kernel stack pointer to get the > task pointer. Gerg would know for sure, he did the 2.5 work in this area. > We should be easily able to switch back to the current_task pointer with a > few small mods to entry.S. A lot of platforms went this way because "current" is hard to do right on an SMP box. Its effectively per CPU dependant, and that means you either set up the MMU to do per CPU pages (via segments or tables) which is a pita, or you do the stack trick. For uniprocessor a global still works perfectly well. > A general comment on the use of inline throughout the kernel. Although > they may show gains on x86 platforms, they often perform worse on > embedded processors with limited cache, as well as adding size. I Code size for critical paths is getting more and more performance critical on x86 as well as on the embedded CPU systems. 3Ghz superscalar processors lose a lot of clocks to a memory stall.
Jivin Ihar Philips Filipau lays it down ... > David McCullough wrote: > > > >A general comment on the use of inline throughout the kernel. Although > >they may show gains on x86 platforms, they often perform worse on > >embedded processors with limited cache, as well as adding size. I > >can't see any way of coding around this though. As long as x86 is > >driving influence, other platforms will jut have to deal with it as > >best they can. > > > > Actually I'm victim on over inlining too. Was at least. > I was running some router on old Pentium's. I remember almost > dramatical drop of performance with newer kernels because of inlining in > net/*. But sure on Xeon P4 it boosts performance... > > Actually what I'm about. > We have classical situation when we have mess of representation and > intentions. > > Representation == 'inline', but intentions - 'inline or it will > break' _and_ 'inline - it runs faster'. > This obviously should be separated. The biggest problem I see is that the inlines are done in header files generally, and to stop them from inlining, you need to be able to switch from an inline to a prototype in the header file. The code from the header then needs to be added to a .o somewhere in the build for the case where inlines are stripped out. Other than providing non-critical inlines either on or off, I can't see the level approach working all that well. A combination of levels that work well on a few platforms may not work well at all on another. Still, just the ability to reduce the inlines would be very useful. Cheers, Davidm > even more. > > #define INLINE_LEVEL some_platform_specific_number > > --------- > > #define inline0 inline_always > > #if INLINE_LEVEL >= 1 > # define inline1 inline_always > #else > # define inline1 > #endif > ... > #if INLINE_LEVEL >= N > # define inlineN inline_always > #else > # define inlineN > #endif > > and so on, giving a platform chance to influence amount of inlining. > better to put it into config with defined by platform defaults. -- David McCullough, davidm@snapgear.com Ph:+61 7 34352815 http://www.SnapGear.com Custom Embedded Solutions + Security Fx:+61 7 38913630 http://www.uCdot.org
Jivin Alan Cox lays it down ... > On Iau, 2003-07-24 at 06:06, David McCullough wrote: > > Back when I first did the 2.4 uClinux port, the m68k MMU code was > > dedicating a register (a2) for current. I thought that was a bad idea > > given how often you run out of registers on the 68k, and made it a > > On some platforms a global register current was a win, I can't speak for > m68k - current is used a lot. I'm sure that using a register for current was the right thing to do at the time. One problem with a global register approach is that the more inlining the code uses, the more like the compiler is going to want that extra register :-) > > On the 2.5/2.6 front, I think the change comes from the 8K (2 page) task > > structure and everyone just masking the kernel stack pointer to get the > > task pointer. Gerg would know for sure, he did the 2.5 work in this area. > > We should be easily able to switch back to the current_task pointer with a > > few small mods to entry.S. > > A lot of platforms went this way because "current" is hard to do right > on an SMP box. Its effectively per CPU dependant, and that means you > either set up the MMU to do per CPU pages (via segments or tables) which > is a pita, or you do the stack trick. For uniprocessor a global still > works perfectly well. Sounds like something that can at least be made conditional on SMP. I'll look into it for m68knommu since it is more likely to care about "size" than SMP. > > A general comment on the use of inline throughout the kernel. Although > > they may show gains on x86 platforms, they often perform worse on > > embedded processors with limited cache, as well as adding size. I > > Code size for critical paths is getting more and more performance critical > on x86 as well as on the embedded CPU systems. 3Ghz superscalar processors > lose a lot of clocks to a memory stall. So should the trend be away from inlining, especially larger functions ? I know on m68k some of the really simple inlines are actually smaller as an inline than as a function call. But they have to be very simple, or only used once. Cheers, Davidm -- David McCullough, davidm@snapgear.com Ph:+61 7 34352815 http://www.SnapGear.com Custom Embedded Solutions + Security Fx:+61 7 38913630 http://www.uCdot.org
On Iau, 2003-07-24 at 13:04, David McCullough wrote:
> So should the trend be away from inlining, especially larger functions ?
>
> I know on m68k some of the really simple inlines are actually smaller as
> an inline than as a function call. But they have to be very simple, or
> only used once.
Cool. As to trends well there are two conflicting ones - less inlines but
also more code because of adding fast paths to cut conditions down on normal
sequences of execution.
On Thursday, Jul 24, 2003, at 06:28 US/Central, Alan Cox wrote:
>
> Code size for critical paths is getting more and more performance
> critical
> on x86 as well as on the embedded CPU systems. 3Ghz superscalar
> processors
> lose a lot of clocks to a memory stall.
So you're arguing for more inlining, because icache speculative
prefetch will pick up the inlined code?
Or you're arguing for less, because code like get_current() which is
called frequently could have a single copy living in icache?
--
Hollis Blanchard
IBM Linux Technology Center
On Iau, 2003-07-24 at 16:30, Hollis Blanchard wrote: > So you're arguing for more inlining, because icache speculative > prefetch will pick up the inlined code? I'm arguing for short inlined fast paths and non inlined unusual paths. > Or you're arguing for less, because code like get_current() which is > called frequently could have a single copy living in icache? Depends how much the jump costs you.
On Thursday, Jul 24, 2003, at 14:37 US/Central, Alan Cox wrote:
> On Iau, 2003-07-24 at 16:30, Hollis Blanchard wrote:
>> So you're arguing for more inlining, because icache speculative
>> prefetch will pick up the inlined code?
>
> I'm arguing for short inlined fast paths and non inlined unusual
> paths.
>
>> Or you're arguing for less, because code like get_current() which is
>> called frequently could have a single copy living in icache?
>
> Depends how much the jump costs you.
And also how big your icache is, and maybe even cpu/bus ratio, etc...
which depend on the arch of course.
So as I saw Ihar suggest earlier in this thread, perhaps there should
be two inline directives: must_inline (for code whose correctness
depends on it) and could_help_performance_inline. Then different archs
could #define could_help_performance_inline as appropriate.
--
Hollis Blanchard
IBM Linux Technology Center
On 07.24, Hollis Blanchard wrote:
> On Thursday, Jul 24, 2003, at 14:37 US/Central, Alan Cox wrote:
>
> > On Iau, 2003-07-24 at 16:30, Hollis Blanchard wrote:
> >> So you're arguing for more inlining, because icache speculative
> >> prefetch will pick up the inlined code?
> >
> > I'm arguing for short inlined fast paths and non inlined unusual
> > paths.
> >
> >> Or you're arguing for less, because code like get_current() which is
> >> called frequently could have a single copy living in icache?
> >
> > Depends how much the jump costs you.
>
> And also how big your icache is, and maybe even cpu/bus ratio, etc...
> which depend on the arch of course.
>
> So as I saw Ihar suggest earlier in this thread, perhaps there should
> be two inline directives: must_inline (for code whose correctness
> depends on it) and could_help_performance_inline. Then different archs
> could #define could_help_performance_inline as appropriate.
>
Or you just define must_inline, and let gcc inline the rest of 'inlines',
based on its own rule of functions size, adjusting the parameters
to gcc to assure (more or less) that what is inlined fits in cache of
the processor one is building for...
(this can be hard, help from gcc hackers will be needed...)
--
J.A. Magallon <jamagallon@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-pre7-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-0.6mdk))
On Thu, Jul 24, 2003 at 11:20:00PM +0200, J.A. Magallon wrote:
> Or you just define must_inline, and let gcc inline the rest of 'inlines',
> based on its own rule of functions size, adjusting the parameters
> to gcc to assure (more or less) that what is inlined fits in cache of
> the processor one is building for...
> (this can be hard, help from gcc hackers will be needed...)
IMO just a CONFIG_INLINE_FUNCTIONS will work, if you
want to conserve space in detriment of speed simply
don't select this option, else you have speed but
a big kernel.
-solca
On 24 July 2003 11:13, Ihar \"Philips\" Filipau wrote:
> I mean 'inline' which means 'this has to be inlined or it will
> break' and 'inline' which means 'inline this please - it adds only 10k
> of code bloat and improve performance in my suppa-puppa-bench by 0.000001%!'
>
> Strictly speaking - separate 'inline' to 'require_inline' and
> 'better_inline'.
> So people who really care about image size - can turn
> 'better_inline' into void, without harm to functionality.
> Actually I saw real performance improvements on my Pentium MMX 133
> (it has $i16k+$d16k of caches I beleive) when I was cutting some of
> inlines out. and I'm not talking about (cache poor) embedded systems...
Which inlines? Let the list know
--
vda
On Thursday, Jul 24, 2003, at 23:22 US/Central, Otto Solares wrote:
> On Thu, Jul 24, 2003 at 11:20:00PM +0200, J.A. Magallon wrote:
>> Or you just define must_inline, and let gcc inline the rest of
>> 'inlines',
>> based on its own rule of functions size, adjusting the parameters
>> to gcc to assure (more or less) that what is inlined fits in cache of
>> the processor one is building for...
>> (this can be hard, help from gcc hackers will be needed...)
>
> IMO just a CONFIG_INLINE_FUNCTIONS will work, if you
> want to conserve space in detriment of speed simply
> don't select this option, else you have speed but
> a big kernel.
Inlines don't always help performance (depending on cache sizes, branch
penalties, frequency of code access...), but they do always increase
code size.
I believe the point Alan was trying to make is not that we should have
more or less inlines, but we should have smarter inlines. I.E. don't
just inline a function to "make it fast"; think about the implications
(and ideally measure it, though I think that becomes problematic when
so many other factors can affect the benefit of a single inlined
function). The specific example he gave was inlining code on the fast
path, while accepting branch/cache penalties for non-inlined code on
the slow path.
--
Hollis Blanchard
IBM Linux Technology Center
Hollis Blanchard wrote:
> I believe the point Alan was trying to make is not that we should have
> more or less inlines, but we should have smarter inlines. I.E. don't
> just inline a function to "make it fast"; think about the implications
> (and ideally measure it, though I think that becomes problematic when so
> many other factors can affect the benefit of a single inlined function).
> The specific example he gave was inlining code on the fast path, while
> accepting branch/cache penalties for non-inlined code on the slow path.
>
But you cannot make this kind of decisions universal.
Some kind of compromise should be found between arch-mantainers and
subsystem-mantainers.
Or beat GCC developer hard so they finally will produce good
optimizing compiler ;-)
Or ask all kernel developpers to work one hour per week on GCC
optimization - I bet GCC will outperform everything else in industry in
less that one year ;-)))
To remind: source of the problem is not inlines, problem is the
compiler, which cannot read our minds yet and generate code we were
expected it to generate.
P.S. Offtopic. As I see it Linux & Linus have made the decision of
optimization. Linux after all is capitalismus creation: who has more
money do control everything. Server market has more money - they do more
work on kernel and they systems are not that far from developers'
workstations - so Linux gets more and more server/workstation oriented.
This will fit desktop market too - if your computer was made to run
WinXP AKA exp(bloat) - it will be capable to run any OS. Linus repeating
'small is beatiful' sounds more and more like crude joke...
As for embedded market - it is already in deep fork and far far away
from vanilla kernels... Vanilla really not that relevant to real world...
In article <20030724120441.GC16168@beast>, David McCullough <davidm@snapgear.com> wrote: | So should the trend be away from inlining, especially larger functions ? | | I know on m68k some of the really simple inlines are actually smaller as | an inline than as a function call. But they have to be very simple, or | only used once. Actually, I would think that the compiler would make the decision in a perfect world. (no smiley) Clearly some programmers think the compiler isn't aggressive about this, and that may be the root problem. Certainly if the compiler makes the choice then -Os should avoid the inline. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979.
In article <3F1F9531.2050204@softhome.net>, Ihar \"Philips\" Filipau <filia@softhome.net> wrote: | Just curious. | | Is there any way to guess inline from inline? | | I mean 'inline' which means 'this has to be inlined or it will | break' and 'inline' which means 'inline this please - it adds only 10k | of code bloat and improve performance in my suppa-puppa-bench by 0.000001%!' | | Strictly speaking - separate 'inline' to 'require_inline' and | 'better_inline'. | So people who really care about image size - can turn | 'better_inline' into void, without harm to functionality. | Actually I saw real performance improvements on my Pentium MMX 133 | (it has $i16k+$d16k of caches I beleive) when I was cutting some of | inlines out. and I'm not talking about (cache poor) embedded systems... Actually you have a very diferent CPU to memory bandwidth ratio than a processor manufactured in this millenium. I use a system like that for test, but please don't optimize for it! Speculation of the day: I suspect that on some laptops which run seriously slower when on battery, the CPU/memory speed changes enough that you could see and measure better performance with a 'slow' and a 'fast' kernel. Speculation, since I'm sure the gain would be down in the noise, one of those 'difference without a distinction' things. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979.