All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Semaphore assembly-code bug
       [not found]                         ` <Pine.LNX.4.61.0410291631250.8616@twinlark.arctic.org.suse.lists.linux.kernel>
@ 2004-10-30  2:04                           ` Andi Kleen
  0 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-30  2:04 UTC (permalink / raw)
  To: dean gaudet
  Cc: linux-os, Andreas Steinmetz, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka, linux-kernel, torvalds

dean gaudet <dean-list-linux-kernel@arctic.org> writes:
> 
> it's worse than that in general -- lea typically goes through the AGU 
> which has either less throughput or longer latency than the ALUs... 
> depending on which x86en.  it's 4 cycles for a lea on p4, vs. 1 for a pop.  
> it's 2 cycles for a lea on k8 vs. 1 for a pop.

On D stepping and later K8 the lea is 1 cycle latency because the
decoder optimizes the lea into an add.

-Andi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
       [not found]                         ` <Pine.LNX.4.58.0410291133220.28839@ppc970.osdl.org.suse.lists.linux.kernel>
@ 2004-10-30  2:13                           ` Andi Kleen
  2004-10-30  9:28                             ` Denis Vlasenko
  0 siblings, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-30  2:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@osdl.org> writes:

> Anyway, it's quite likely that for several CPU's the fastest sequence ends 
> up actually being 
> 
> 	movl 4(%esp),%ecx
> 	movl 8(%esp),%edx
> 	movl 12(%esp),%eax
> 	addl $16,%esp
> 
> which is also one of the biggest alternatives.

For K8 it should be the fastest way. K7 probably too.

-Andi

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-30  2:13                           ` Andi Kleen
@ 2004-10-30  9:28                             ` Denis Vlasenko
  2004-10-30 17:53                               ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30  9:28 UTC (permalink / raw)
  To: Andi Kleen, Linus Torvalds; +Cc: linux-kernel

On Saturday 30 October 2004 05:13, Andi Kleen wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > Anyway, it's quite likely that for several CPU's the fastest sequence ends 
> > up actually being 
> > 
> > 	movl 4(%esp),%ecx
> > 	movl 8(%esp),%edx
> > 	movl 12(%esp),%eax
> > 	addl $16,%esp
> > 
> > which is also one of the biggest alternatives.
> 
> For K8 it should be the fastest way. K7 probably too.

Pity. I always loved 1 byte insns :)

/me hopes that K8 rev E or K9 will have optimized pop.
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-30  9:28                             ` Denis Vlasenko
@ 2004-10-30 17:53                               ` Linus Torvalds
  2004-10-30 21:00                                 ` Denis Vlasenko
  2004-10-31  0:39                                 ` Semaphore assembly-code bug Andi Kleen
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-30 17:53 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Andi Kleen, linux-kernel



On Sat, 30 Oct 2004, Denis Vlasenko wrote:
>
> On Saturday 30 October 2004 05:13, Andi Kleen wrote:
> > Linus Torvalds <torvalds@osdl.org> writes:
> > 
> > > Anyway, it's quite likely that for several CPU's the fastest sequence ends 
> > > up actually being 
> > > 
> > > 	movl 4(%esp),%ecx
> > > 	movl 8(%esp),%edx
> > > 	movl 12(%esp),%eax
> > > 	addl $16,%esp
> > > 
> > > which is also one of the biggest alternatives.
> > 
> > For K8 it should be the fastest way. K7 probably too.
> 
> Pity. I always loved 1 byte insns :)

I personally am a _huge_ believer in small code. 

The sequence

	popl %eax
	popl %ecx
	popl %edx
	popl %eax

is four bytes. In contrast, the three moves and an add is 15 bytes. That's 
almost 4 times as big.

And size _does_ matter. The extra 11 bytes means that if you have six of
these sequences in your program, you are pretty much _guaranteed_ one more
icache miss from memory. That's a few hundred cycles these days.  
Considering that you _maybe_ won a cycle or two each time it was executed,
it's not at all clear that it's a win, except in benchmarks that have huge
repeat-rates. Real life doesn't usually have that. In many real-life
schenarios, repeat rates are in the tens of hundreds for most code...

And that's ignoring things like disk load times etc.

Sadly, the situation is often one where when you actually do all the 
performance testing, you artificially increase the repeat-rates hugely: 
you run the same program a thousand times in order to get a good profile, 
and you keep in the the cache all the time. So performance analysis often 
doesn't actually _see_ the downsides.

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-30 17:53                               ` Linus Torvalds
@ 2004-10-30 21:00                                 ` Denis Vlasenko
  2004-10-30 21:14                                   ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
  2004-10-31  0:39                                 ` Semaphore assembly-code bug Andi Kleen
  1 sibling, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 21:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

On Saturday 30 October 2004 20:53, Linus Torvalds wrote:
> > > > 	movl 4(%esp),%ecx
> > > > 	movl 8(%esp),%edx
> > > > 	movl 12(%esp),%eax
> > > > 	addl $16,%esp
> > > > 
> > > > which is also one of the biggest alternatives.
> > > 
> > > For K8 it should be the fastest way. K7 probably too.
> > 
> > Pity. I always loved 1 byte insns :)
> 
> I personally am a _huge_ believer in small code. 

Thankfully you are not alone - a horde of uclibc/dietlibc/busybox
users shares these views. Also see http://smarden.org/pape/

> The sequence
> 
> 	popl %eax
> 	popl %ecx
> 	popl %edx
> 	popl %eax
> 
> is four bytes. In contrast, the three moves and an add is 15 bytes. That's 
> almost 4 times as big.
> 
> And size _does_ matter. The extra 11 bytes means that if you have six of
> these sequences in your program, you are pretty much _guaranteed_ one more
> icache miss from memory. That's a few hundred cycles these days.  
> Considering that you _maybe_ won a cycle or two each time it was executed,
> it's not at all clear that it's a win, except in benchmarks that have huge
> repeat-rates. Real life doesn't usually have that. In many real-life
> schenarios, repeat rates are in the tens of hundreds for most code...

If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
15364 root      15   0 38008  26M 28496 S     0,0 10,8   0:57   0 kmail
20022 root      16   0 40760  24M 23920 S     0,1 10,0   0:04   0 mozilla-bin
 1627 root      14  -1 71064  19M 53192 S <   0,1  7,9   3:16   0 X
 1700 root      15   0 25348  16M 23508 S     0,1  6,5   0:46   0 kdeinit
 3578 root      15   0 24032  14M 21524 S     0,5  5,8   0:23   0 konsole
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 21:00                                 ` Denis Vlasenko
@ 2004-10-30 21:14                                   ` Lee Revell
  2004-10-30 22:11                                     ` Denis Vlasenko
  2004-10-31  6:37                                     ` Jan Engelhardt
  0 siblings, 2 replies; 99+ messages in thread
From: Lee Revell @ 2004-10-30 21:14 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Linus Torvalds, Andi Kleen, linux-kernel

On Sun, 2004-10-31 at 00:00 +0300, Denis Vlasenko wrote:
> If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
> 
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
> 15364 root      15   0 38008  26M 28496 S     0,0 10,8   0:57   0 kmail
> 20022 root      16   0 40760  24M 23920 S     0,1 10,0   0:04   0 mozilla-bin
>  1627 root      14  -1 71064  19M 53192 S <   0,1  7,9   3:16   0 X
>  1700 root      15   0 25348  16M 23508 S     0,1  6,5   0:46   0 kdeinit
>  3578 root      15   0 24032  14M 21524 S     0,5  5,8   0:23   0 konsole

Wow. evolution is now more bloated than kmail.

 1424 rlrevell  15   0  125m  47m  29m S  7.8 10.1   1:41.78 evolution
 1508 rlrevell  15   0 92432  30m  29m S  0.0  6.4   0:14.15 mozilla-bin
 1090 root      16   0 55676  18m  40m S 24.8  3.9   0:46.98 XFree86
 1379 rlrevell  15   0 33776  16m  18m S  0.3  3.5   0:06.65 nautilus
 1377 rlrevell  15   0 19392  11m  15m S  0.0  2.5   0:03.29 gnome-panel
 1458 rlrevell  16   0 28188  11m  15m S  3.9  2.5   0:10.44 gnome-terminal
 1307 rlrevell  15   0 20828  11m  17m S  0.0  2.4   0:03.08 gnome-settings-

Lee


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 21:14                                   ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
@ 2004-10-30 22:11                                     ` Denis Vlasenko
  2004-10-30 22:25                                       ` Lee Revell
  2004-10-30 22:27                                       ` Tim Hockin
  2004-10-31  6:37                                     ` Jan Engelhardt
  1 sibling, 2 replies; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 22:11 UTC (permalink / raw)
  To: Lee Revell; +Cc: Linus Torvalds, Andi Kleen, linux-kernel

On Sunday 31 October 2004 00:14, Lee Revell wrote:
> On Sun, 2004-10-31 at 00:00 +0300, Denis Vlasenko wrote:
> > If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
> > 
> >   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
> > 15364 root      15   0 38008  26M 28496 S     0,0 10,8   0:57   0 kmail
> > 20022 root      16   0 40760  24M 23920 S     0,1 10,0   0:04   0 mozilla-bin
> >  1627 root      14  -1 71064  19M 53192 S <   0,1  7,9   3:16   0 X
> >  1700 root      15   0 25348  16M 23508 S     0,1  6,5   0:46   0 kdeinit
> >  3578 root      15   0 24032  14M 21524 S     0,5  5,8   0:23   0 konsole
> 
> Wow. evolution is now more bloated than kmail.
> 
>  1424 rlrevell  15   0  125m  47m  29m S  7.8 10.1   1:41.78 evolution
>  1508 rlrevell  15   0 92432  30m  29m S  0.0  6.4   0:14.15 mozilla-bin
>  1090 root      16   0 55676  18m  40m S 24.8  3.9   0:46.98 XFree86
>  1379 rlrevell  15   0 33776  16m  18m S  0.3  3.5   0:06.65 nautilus
>  1377 rlrevell  15   0 19392  11m  15m S  0.0  2.5   0:03.29 gnome-panel
>  1458 rlrevell  16   0 28188  11m  15m S  3.9  2.5   0:10.44 gnome-terminal
>  1307 rlrevell  15   0 20828  11m  17m S  0.0  2.4   0:03.08 gnome-settings-

Well, I can try to compile packages with different options
for size, I can link against small libc, but I feel this
does not solve the problem: the code itself is bloated...

I am not a code genius, but want to help.

Hmm probably some bloat-detection tools would be helpful,
like "show me source_lines/object_size ratios of fonctions in
this ELF object file". Those with low ratio are suspects of
excessive inlining etc.

More ideas, anyone?
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:11                                     ` Denis Vlasenko
@ 2004-10-30 22:25                                       ` Lee Revell
  2004-10-31 14:06                                         ` Diego Calleja
  2004-10-30 22:27                                       ` Tim Hockin
  1 sibling, 1 reply; 99+ messages in thread
From: Lee Revell @ 2004-10-30 22:25 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Linus Torvalds, Andi Kleen, linux-kernel

On Sun, 2004-10-31 at 01:11 +0300, Denis Vlasenko wrote:
> Well, I can try to compile packages with different options
> for size, I can link against small libc, but I feel this
> does not solve the problem: the code itself is bloated...
> 
> I am not a code genius, but want to help.
> 
> Hmm probably some bloat-detection tools would be helpful,
> like "show me source_lines/object_size ratios of fonctions in
> this ELF object file". Those with low ratio are suspects of
> excessive inlining etc.
> 
> More ideas, anyone?

I ageww it's a hard problem.  Right now there is massive pressure on
Linux application developers to add features to catch up with MS and
Apple.  This inevitably leads to bloat, we all know that efficiency is
the first thing to go out the window in that situation, the problem is
exacerbated by the wide availability of fast machines.  It's an old,
depressing story...

That being said it would indeed be nice if we had more tools to quantify
bloat. 

Lee


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:11                                     ` Denis Vlasenko
  2004-10-30 22:25                                       ` Lee Revell
@ 2004-10-30 22:27                                       ` Tim Hockin
  2004-10-30 22:44                                         ` Jeff Garzik
                                                           ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 22:27 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

On Sun, Oct 31, 2004 at 01:11:07AM +0300, Denis Vlasenko wrote:
> I am not a code genius, but want to help.
> 
> Hmm probably some bloat-detection tools would be helpful,
> like "show me source_lines/object_size ratios of fonctions in
> this ELF object file". Those with low ratio are suspects of
> excessive inlining etc.

The problem with apps of this sort is the multiple layers of abstraction.

Xlib, GLib, GTK, GNOME, Pango, XML, etc.

No one wants to duplicate effort (rightly so).  Each of these libs tries
to do EVERY POSSIBLE thing.  They all end up bloated.  Then you have to
link them all in.  You end up bloated.  Then it is very easy to rely on
those libs for EVERYTHING, rather thank actually thinking.

So you end up with the mindset of, for example, "if it's text it's XML".
You have to parse everything as XML, when simple parsers would be tons
faster and simpler and smaller.

Bloat is cause by feature creep at every layer, not just the app.

Youck.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:27                                       ` Tim Hockin
@ 2004-10-30 22:44                                         ` Jeff Garzik
  2004-10-30 22:50                                           ` Tim Hockin
  2004-10-31 20:15                                           ` Theodore Ts'o
  2004-10-30 23:13                                         ` Denis Vlasenko
  2004-10-31  6:49                                         ` Jan Engelhardt
  2 siblings, 2 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-30 22:44 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

Tim Hockin wrote:
> So you end up with the mindset of, for example, "if it's text it's XML".
> You have to parse everything as XML, when simple parsers would be tons
> faster and simpler and smaller.


hehehe.  One of the reasons why I like XML is that you don't have to 
keep cloning new parsers.

	Jeff



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:13                                         ` Denis Vlasenko
@ 2004-10-30 22:45                                           ` Alan Cox
  2004-10-31  1:21                                             ` Z Smith
  2004-10-30 23:20                                           ` [OT] " Lee Revell
  2004-10-30 23:28                                           ` Tim Hockin
  2 siblings, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-30 22:45 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Tim Hockin, Lee Revell, Linus Torvalds, Andi Kleen,
	Linux Kernel Mailing List

The gnome/gtk folks know they have a lot of code bloat, and know how to
shave about 10Mb off the desktop size already. What they don't have is
enough hands and brains to do this and the other stuff that is pressing.
So if the desktop stuff is annoying you join gnome-love or whatever the
kde equivalent is 8)

Alan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:44                                         ` Jeff Garzik
@ 2004-10-30 22:50                                           ` Tim Hockin
  2004-10-31 20:15                                           ` Theodore Ts'o
  1 sibling, 0 replies; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 22:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
> Tim Hockin wrote:
> >So you end up with the mindset of, for example, "if it's text it's XML".
> >You have to parse everything as XML, when simple parsers would be tons
> >faster and simpler and smaller.
> 
> 
> hehehe.  One of the reasons why I like XML is that you don't have to 
> keep cloning new parsers.

I'm fine with XML, when it makes sense.  In fact, I wrote an XML parser.
It's blazingly fast.  But it doesn't try to do everything for everyone.
It does just as much as I needed.  And Whn I need XML, I don;t have any
problem linking it in.  It's only a couple hundred lines of C.

What irks me is best demonstrated by this:

At OLS last year or the year before, at a talk about DBUS, someone asked
about the DBUS protocol.  When told that it was binary, they asked if
there was any advantage to that over text.  The reply "We didn't want to
link an XML parser in".

Now, I am fine with not wanting to ad bloat.  But umm, the question was
about TEXT, not XML.  They are not the same thing.  Not all text should be
XML.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:20                                           ` [OT] " Lee Revell
@ 2004-10-30 22:52                                             ` Alan Cox
  2004-10-31  1:09                                               ` Ken Moffat
  2004-10-31  0:48                                             ` Andi Kleen
  1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-30 22:52 UTC (permalink / raw)
  To: Lee Revell
  Cc: Denis Vlasenko, Tim Hockin, Linus Torvalds, Andi Kleen,
	Linux Kernel Mailing List

On Sul, 2004-10-31 at 00:20, Lee Revell wrote:
> I think very few application developers understand the point Linus made
> - that bigger code IS slower code due to cache misses.  If this were
> widely understood we would be in pretty good shape.

On my laptop both Openoffice and gnome are measurably faster if you
build the lot with -Os (except a couple of image libs)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:27                                       ` Tim Hockin
  2004-10-30 22:44                                         ` Jeff Garzik
@ 2004-10-30 23:13                                         ` Denis Vlasenko
  2004-10-30 22:45                                           ` Alan Cox
                                                             ` (2 more replies)
  2004-10-31  6:49                                         ` Jan Engelhardt
  2 siblings, 3 replies; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 23:13 UTC (permalink / raw)
  To: Tim Hockin; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

On Sunday 31 October 2004 01:27, Tim Hockin wrote:
> On Sun, Oct 31, 2004 at 01:11:07AM +0300, Denis Vlasenko wrote:
> > I am not a code genius, but want to help.
> > 
> > Hmm probably some bloat-detection tools would be helpful,
> > like "show me source_lines/object_size ratios of fonctions in
> > this ELF object file". Those with low ratio are suspects of
> > excessive inlining etc.
> 
> The problem with apps of this sort is the multiple layers of abstraction.
> 
> Xlib, GLib, GTK, GNOME, Pango, XML, etc.

I think it makes sense to start from lower layers first:

Kernel team is reasonably aware of the bloat danger.

glibc is worse, but thanks to heroic actions of Eric Andersen
we have mostly feature complete uclibc, 4 times (!)
smaller than glibc.

Xlib, GLib.... - didn't look into them apart from cases
when they do not build or in bug hunting sessions.
Quick data point: glib-1.2.10 is 1/2 of uclibc in size.
glib-2.2.2 is 2 times uclibc. x4 growth :(

> No one wants to duplicate effort (rightly so).  Each of these libs tries
> to do EVERY POSSIBLE thing.  They all end up bloated.  Then you have to
> link them all in.  You end up bloated.  Then it is very easy to rely on
> those libs for EVERYTHING, rather thank actually thinking.
> 
> So you end up with the mindset of, for example, "if it's text it's XML".
> You have to parse everything as XML, when simple parsers would be tons
> faster and simpler and smaller.
> 
> Bloat is cause by feature creep at every layer, not just the app.

I actually tried to convince maintainers of one package
that their code is needlessly complex. I did send patches
to remedy that a bit while fixing real bugs. Rejected.
Bugs were planned to be fixed by adding more code.
I've lost all hope on that case.

I guess this is a reason why bloat problem tend to be solved
by rewrite from scratch. I could name quite a few cases:

glibc -> dietlibc,uclibc
coreutils -> busybox
named -> djbdns
inetd -> daemontools+ucspi-tcp
sendmail -> qmail
syslogd -> socklog (http://smarden.org/socklog/)

It's sort of frightening that someone will need to
rewrite Xlib or, say, OpenOffice :(
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:13                                         ` Denis Vlasenko
  2004-10-30 22:45                                           ` Alan Cox
@ 2004-10-30 23:20                                           ` Lee Revell
  2004-10-30 22:52                                             ` Alan Cox
  2004-10-31  0:48                                             ` Andi Kleen
  2004-10-30 23:28                                           ` Tim Hockin
  2 siblings, 2 replies; 99+ messages in thread
From: Lee Revell @ 2004-10-30 23:20 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Tim Hockin, Linus Torvalds, Andi Kleen, linux-kernel

On Sun, 2004-10-31 at 02:13 +0300, Denis Vlasenko wrote:
> It's sort of frightening that someone will need to
> rewrite Xlib or, say, OpenOffice :(

I think very few application developers understand the point Linus made
- that bigger code IS slower code due to cache misses.  If this were
widely understood we would be in pretty good shape.

Lee


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:13                                         ` Denis Vlasenko
  2004-10-30 22:45                                           ` Alan Cox
  2004-10-30 23:20                                           ` [OT] " Lee Revell
@ 2004-10-30 23:28                                           ` Tim Hockin
  2004-10-31  2:04                                             ` Michael Clark
  2 siblings, 1 reply; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 23:28 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

On Sun, Oct 31, 2004 at 02:13:37AM +0300, Denis Vlasenko wrote:
> > Bloat is cause by feature creep at every layer, not just the app.
> 
> I actually tried to convince maintainers of one package
> that their code is needlessly complex. I did send patches
> to remedy that a bit while fixing real bugs. Rejected.
> Bugs were planned to be fixed by adding more code.
> I've lost all hope on that case.

See, there is an ego problem, too.  If you rewrite my code, it means
you're better than I am.  Rejected.

Features win over efficiency.  Seriously, look at glibc.  Hav eyou ever
tried to fix a bug in it?  Holy CRAP is that horrible code.  Each chunk of
code itself is OK (though it abuses macrso so thoroughly I hesitate to
call it C code).  But it tried to support every architecture x every OS.
You know what?  I don't CARE if the glibc code compiles on HPUX or not.
HPUX has it's own libc.

> I guess this is a reason why bloat problem tend to be solved
> by rewrite from scratch. I could name quite a few cases:

From-scratch is a huge risk.  But yeah, sometimes it has to be.

> It's sort of frightening that someone will need to
> rewrite Xlib or, say, OpenOffice :(

Never gonna happen.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-30 17:53                               ` Linus Torvalds
  2004-10-30 21:00                                 ` Denis Vlasenko
@ 2004-10-31  0:39                                 ` Andi Kleen
  2004-10-31  1:43                                   ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-31  0:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Denis Vlasenko, Andi Kleen, linux-kernel

> I personally am a _huge_ believer in small code. 
> 
> The sequence
> 
> 	popl %eax
> 	popl %ecx
> 	popl %edx
> 	popl %eax
> 
> is four bytes. In contrast, the three moves and an add is 15 bytes. That's 
> almost 4 times as big.

Using the long stack setup code was found to be a significant
win when enough registers were saved (several percent in real benchmarks) 
on K8 gcc. It speed up all function calls considerably because it 
eliminates several stalls for each function entry/exit.  The popls
will all depend on each other because of their implicied reference
to esp.  

Yes, it bloats the code, but function calls happen so often that having them
faster is really noticeable. 

The K8 has quite big caches and is not decoding limited, so it 
wasn't a too bad tradeoff there.

Ideally you would want to only do it on hot functions and optimize
rarely called functions for code size, but that would require profile 
feedback which is often not feasible (JITs have an advantage here) 

Unfortunately I don't think it is practically feasible for the kernel because
we rely on to be able to recreate the same vmlinuxs for debugging.
[It's a pity actually because modern compilers do a lot better
with profile feedback] 

On P4 on the other hand it doesn't help at all and only makes
the code bigger. I did it from hand in the x86-64 syscall
code too (that was before there was EM64T, but I still think it was a 
good idea). Perhaps AMD adds special hardware in some future CPU that
also makes it unnecessary, but currently it's like this and it helps.

-Andi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:20                                           ` [OT] " Lee Revell
  2004-10-30 22:52                                             ` Alan Cox
@ 2004-10-31  0:48                                             ` Andi Kleen
  1 sibling, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-31  0:48 UTC (permalink / raw)
  To: Lee Revell
  Cc: Denis Vlasenko, Tim Hockin, Linus Torvalds, Andi Kleen, linux-kernel

On Sat, Oct 30, 2004 at 07:20:04PM -0400, Lee Revell wrote:
> On Sun, 2004-10-31 at 02:13 +0300, Denis Vlasenko wrote:
> > It's sort of frightening that someone will need to
> > rewrite Xlib or, say, OpenOffice :(
> 
> I think very few application developers understand the point Linus made
> - that bigger code IS slower code due to cache misses.  If this were
> widely understood we would be in pretty good shape.

It's true in some cases, but not true in others. Don't make it your
gospel. 

-Andi

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:52                                             ` Alan Cox
@ 2004-10-31  1:09                                               ` Ken Moffat
  2004-10-31  2:42                                                 ` Tim Connors
  2004-10-31 14:44                                                 ` Alan Cox
  0 siblings, 2 replies; 99+ messages in thread
From: Ken Moffat @ 2004-10-31  1:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Lee Revell, Denis Vlasenko, Tim Hockin, Linus Torvalds,
	Andi Kleen, Linux Kernel Mailing List

On Sat, 30 Oct 2004, Alan Cox wrote:

> On Sul, 2004-10-31 at 00:20, Lee Revell wrote:
> > I think very few application developers understand the point Linus made
> > - that bigger code IS slower code due to cache misses.  If this were
> > widely understood we would be in pretty good shape.
>
> On my laptop both Openoffice and gnome are measurably faster if you
> build the lot with -Os (except a couple of image libs)
>

Depends how much of gnome you use.  I used to swear by -Os for
non-toolchain stuff, but in the end I got bitten by gnumeric on x86.
http://bugs.gnome.org/show_bug.cgi?id=128834 is similar, but in my case
opening *any* spreadsheet would cause gnumeric to segfault (gcc-3.3
series).  Add in the time spent rebuilding gnome before I found this bug
report, and adding extra parts of gnome just in case I missed something,
and the time to load it is irrelevant.  Since then I've had an anecdotal
report that -Os is known to cause problems with gnome.  I s'pose people
will say it serves me right for doing my initial testing on ppc which
didn't have this problem ;)  The point is that -Os is *much* less tested
than -O2 at the moment.

Ken
-- 
 das eine Mal als Tragödie, das andere Mal als Farce


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:45                                           ` Alan Cox
@ 2004-10-31  1:21                                             ` Z Smith
  2004-10-31  2:47                                               ` Jim Nelson
  2004-10-31 15:19                                               ` Alan Cox
  0 siblings, 2 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31  1:21 UTC (permalink / raw)
  Cc: Linux Kernel Mailing List

Alan Cox wrote:

> So if the desktop stuff is annoying you join gnome-love or whatever the
> kde equivalent is 8)

Or join me in my effort to limit bloat. Why use an X server
that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
of code with very minimal kmallocing?

home.comcast.net/~plinius/fbui.html

Zack Smith
Bloat Liberation Front


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-31  0:39                                 ` Semaphore assembly-code bug Andi Kleen
@ 2004-10-31  1:43                                   ` Linus Torvalds
  2004-10-31  2:04                                     ` Andi Kleen
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-31  1:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Denis Vlasenko, linux-kernel



On Sun, 31 Oct 2004, Andi Kleen wrote:
> 
> Using the long stack setup code was found to be a significant
> win when enough registers were saved (several percent in real benchmarks) 
> on K8 gcc. 

For _what_?

Real applications, or SpecInt?

The fact is, SpecInt is not very interesting, because it has almost _zero_
icache footprint, and it has generally big repeat-rates, and to make
matters worse, you are allowed (and everybody does) warm up the caches by
running before you actually do the benchmark run.

_None_ of these are realistic for real life workloads. 

>  It speed up all function calls considerably because it 
> eliminates several stalls for each function entry/exit. 

.. it shaves off a few cycles in the cached case, yes.

> The popls will all depend on each other because of their implicied
> reference to esp.

Which is only true on moderately stupid CPU's. Two pop's don't _really_ 
depend on each other in any real sense, and there are CPU's that will 
happily dual-issue them, or at least not stall in between (ie the pop's 
will happily keep the memory unit 100% busy).

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 23:28                                           ` Tim Hockin
@ 2004-10-31  2:04                                             ` Michael Clark
  0 siblings, 0 replies; 99+ messages in thread
From: Michael Clark @ 2004-10-31  2:04 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel

On 10/31/04 07:28, Tim Hockin wrote:
> On Sun, Oct 31, 2004 at 02:13:37AM +0300, Denis Vlasenko wrote:
> 
>>>Bloat is cause by feature creep at every layer, not just the app.
>>
>>I actually tried to convince maintainers of one package
>>that their code is needlessly complex. I did send patches
>>to remedy that a bit while fixing real bugs. Rejected.
>>Bugs were planned to be fixed by adding more code.
>>I've lost all hope on that case.
> 
> 
> See, there is an ego problem, too.  If you rewrite my code, it means
> you're better than I am.  Rejected.
> 
> Features win over efficiency.  Seriously, look at glibc.  Hav eyou ever
> tried to fix a bug in it?  Holy CRAP is that horrible code.  Each chunk of
> code itself is OK (though it abuses macrso so thoroughly I hesitate to
> call it C code).  But it tried to support every architecture x every OS.
> You know what?  I don't CARE if the glibc code compiles on HPUX or not.
> HPUX has it's own libc.
> 
> 
>>I guess this is a reason why bloat problem tend to be solved
>>by rewrite from scratch. I could name quite a few cases:
> 
> 
> From-scratch is a huge risk.  But yeah, sometimes it has to be.
> 
> 
>>It's sort of frightening that someone will need to
>>rewrite Xlib or, say, OpenOffice :(

Well, the xlib rewrite is happening (XCB/XCL).
One of the reasons cited is the size of xlib.

   http://www.freedesktop.org/Software/xcb

~mc

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-31  1:43                                   ` Linus Torvalds
@ 2004-10-31  2:04                                     ` Andi Kleen
  0 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-31  2:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, Denis Vlasenko, linux-kernel

On Sat, Oct 30, 2004 at 06:43:21PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 31 Oct 2004, Andi Kleen wrote:
> > 
> > Using the long stack setup code was found to be a significant
> > win when enough registers were saved (several percent in real benchmarks) 
> > on K8 gcc. 
> 
> For _what_?
> 
> Real applications, or SpecInt?

iirc gcc itself was faster (the modern one,  not the old version in SpecInt) 

KDE startup ended up being faster too, but that may have been due to other
improvements too.

This was all tested on CPUs with very large caches (1MB L2), you
can pack a lot of code into that.

Also when people benchmark -m64 code compared to -m32 they often
see large improvements on AMD64 (as long as the code isn't long or pointer
memory bound), and I suspect at least part of that can be explained
by the -m64 gcc defaulting to the long function prologues.

Another example of larger code usually being better is x87 vs SSE2 floating
point math. 

> The fact is, SpecInt is not very interesting, because it has almost _zero_
> icache footprint, and it has generally big repeat-rates, and to make

I don't think it's generally true. one counter example is the gcc subtest
in SpecInt. 


> >  It speed up all function calls considerably because it 
> > eliminates several stalls for each function entry/exit. 
> 
> .. it shaves off a few cycles in the cached case, yes.

I would expect it to help in the uncached case too because
the CPU does very aggressive prefetching of code. Once 
it gets started on a function it will fetch it very quickly.

> 
> > The popls will all depend on each other because of their implicied
> > reference to esp.
> 
> Which is only true on moderately stupid CPU's. Two pop's don't _really_ 

I don't see the K8 as a stupid CPU.

> depend on each other in any real sense, and there are CPU's that will 
> happily dual-issue them, or at least not stall in between (ie the pop's 
> will happily keep the memory unit 100% busy).

Yes, there are. And there are others that don't.

-Andi

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  1:09                                               ` Ken Moffat
@ 2004-10-31  2:42                                                 ` Tim Connors
  2004-10-31  4:45                                                   ` Paul
  2004-10-31 14:44                                                 ` Alan Cox
  1 sibling, 1 reply; 99+ messages in thread
From: Tim Connors @ 2004-10-31  2:42 UTC (permalink / raw)
  To: Ken Moffat; +Cc: Linux Kernel Mailing List

Ken Moffat <ken@kenmoffat.uklinux.net> said on Sun, 31 Oct 2004 01:09:54 +0000 (GMT):
> and the time to load it is irrelevant.  Since then I've had an anecdotal
> report that -Os is known to cause problems with gnome.  I s'pose people
> will say it serves me right for doing my initial testing on ppc which
> didn't have this problem ;)  The point is that -Os is *much* less tested
> than -O2 at the moment.

Because people suck, and don't use it and hence test it.

Ie, test it!

I can't, because I prefer to stay away from gnome instead.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
"Warning: Do not look into laser with remaining eye" -- a physics experiment
"Press emergency laser shutdown button with remaining hand" -- J.D.Baldwin @ ASR

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  1:21                                             ` Z Smith
@ 2004-10-31  2:47                                               ` Jim Nelson
  2004-10-31 15:19                                               ` Alan Cox
  1 sibling, 0 replies; 99+ messages in thread
From: Jim Nelson @ 2004-10-31  2:47 UTC (permalink / raw)
  To: Z Smith; +Cc: Linux Kernel Mailing List

Z Smith wrote:
> Alan Cox wrote:
> 
>> So if the desktop stuff is annoying you join gnome-love or whatever the
>> kde equivalent is 8)
> 
> 
> Or join me in my effort to limit bloat. Why use an X server
> that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> of code with very minimal kmallocing?
> 
> home.comcast.net/~plinius/fbui.html
> 
> Zack Smith
> Bloat Liberation Front
> 

Because some of us use remote X clients on big iron with an X server on your 
desktop.  IIRC (been a long time since my CAD classes), a whole bunch of FEA and 
CAE/CAD applications worked this way.

There is a lot more flexibility inherent in user-space compared to kernel-space. 
You can use PAM, Kerberos, and a whole host of other security devices that would 
be difficult to implement efficiently in kernel-space.

Dude, that's a cool hack, but just about everything you did could be done with 
svgalib and the input core interface.  The advantage to svgalib is that if that 
interface dies, you can recover the machine pretty easily, whereas kernel panics 
are a bit more disruptive.

Still - it would be a nifty add-on for POS terminals, etc., just not the kind of 
thing I'd expect to see in the kernel anytime soon.  Once 2.7 is started, see if 
people are more receptive.  Take the time to flesh it out, get some more people on 
board, see if Sourceforge will host the project, and lose the advertising campaign 
- that's not likely to win any friends or supporters around here.

I don't mean to be harsh, but c'mon - "Bloat Liberation Front" - err... okaaay...

Good luck,

Jim

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  2:42                                                 ` Tim Connors
@ 2004-10-31  4:45                                                   ` Paul
  0 siblings, 0 replies; 99+ messages in thread
From: Paul @ 2004-10-31  4:45 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Tim Connors <tconnors+linuxkernel1099190446@astro.swin.edu.au>, on Sun Oct 31, 2004 [01:42:34 PM] said:
> Ken Moffat <ken@kenmoffat.uklinux.net> said on Sun, 31 Oct 2004 01:09:54 +0000 (GMT):
> > and the time to load it is irrelevant.  Since then I've had an anecdotal
> > report that -Os is known to cause problems with gnome.  I s'pose people
> > will say it serves me right for doing my initial testing on ppc which
> > didn't have this problem ;)  The point is that -Os is *much* less tested
> > than -O2 at the moment.
> 
> Because people suck, and don't use it and hence test it.
> 
> Ie, test it!
> 
> I can't, because I prefer to stay away from gnome instead.
> 

	Hi;

	Ive been using -Os as my default compile flag under
Gentoo for probably over 2 years now. Havent noted any real
problems, and thats nearly 3gig of compressed source code
compiled on what is just my current system image.
	(Well, I might suck a little because I havent done any
benchmarks or comparisons as to the actual benifits of doing
so. Also, I use fvwm;)

Paul
set@pobox.com

> -- 
> TimC -- http://astronomy.swin.edu.au/staff/tconnors/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 21:14                                   ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
  2004-10-30 22:11                                     ` Denis Vlasenko
@ 2004-10-31  6:37                                     ` Jan Engelhardt
  1 sibling, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31  6:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Denis Vlasenko, Linus Torvalds

>> If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
>>
>>   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
>> 15364 root      15   0 38008  26M 28496 S     0,0 10,8   0:57   0 kmail
>> 20022 root      16   0 40760  24M 23920 S     0,1 10,0   0:04   0 mozilla-bin
>>  1627 root      14  -1 71064  19M 53192 S <   0,1  7,9   3:16   0 X
>>  1700 root      15   0 25348  16M 23508 S     0,1  6,5   0:46   0 kdeinit
>>  3578 root      15   0 24032  14M 21524 S     0,5  5,8   0:23   0 konsole

Heh, and guess what: the people in #kde (irc.freenode.net for example) deny
that it's their fault with the statement "bah, that's shared libraries"!
If that's a lie or not, or a semi-lie, I'm definitely sure THAT libdcop libmcop
and every shitcrap that's running makes it almost impossible to run even on
Duron-800 w/256.

>Wow. evolution is now more bloated than kmail.
>
> 1424 rlrevell  15   0  125m  47m  29m S  7.8 10.1   1:41.78 evolution
> 1508 rlrevell  15   0 92432  30m  29m S  0.0  6.4   0:14.15 mozilla-bin
> 1090 root      16   0 55676  18m  40m S 24.8  3.9   0:46.98 XFree86
> 1379 rlrevell  15   0 33776  16m  18m S  0.3  3.5   0:06.65 nautilus
> 1377 rlrevell  15   0 19392  11m  15m S  0.0  2.5   0:03.29 gnome-panel
> 1458 rlrevell  16   0 28188  11m  15m S  3.9  2.5   0:10.44 gnome-terminal
> 1307 rlrevell  15   0 20828  11m  17m S  0.0  2.4   0:03.08 gnome-settings-

Gnome is no better. (Flamewar: I like ICEWM)

The only thing more bloated is the X server itself when it runs with the
proprietary NV GL core:

USER   PID MEM%    VSZ   RSZ STAT START   TIME COMMAND
root  5220  7.8 417872 20220 SL   08:37   0:03 X -noliste[...]


Jan Engelhardt
-- 
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:27                                       ` Tim Hockin
  2004-10-30 22:44                                         ` Jeff Garzik
  2004-10-30 23:13                                         ` Denis Vlasenko
@ 2004-10-31  6:49                                         ` Jan Engelhardt
  2004-10-31 21:09                                           ` Z Smith
  2004-11-01 15:17                                           ` Lee Revell
  2 siblings, 2 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31  6:49 UTC (permalink / raw)
  Cc: linux-kernel

>> Hmm probably some bloat-detection tools would be helpful,
>> like "show me source_lines/object_size ratios of fonctions in
>> this ELF object file". Those with low ratio are suspects of
>> excessive inlining etc.

Hm, I've got a (very simple) line determining utility,
http://linux01.org:2222/f/UHXT/bin/sourcefuncsize
maybe someone can pipe it together with ls -l or whatever.

>The problem with apps of this sort is the multiple layers of abstraction.
>
>Xlib, GLib, GTK, GNOME, Pango, XML, etc.

At least they know one thing: that thou should not stuff everything into one
.so but multiple ones (if it's a lot). That /may/ reduce the size-in-memory,
because not all .so's need to be loaded. OTOH, most apps load /all/ anyway.
Heh, there we go.

>Bloat is cause by feature creep at every layer, not just the app.

I sense Java and C# being the best example.


Z Smith wrote:
>Or join me in my effort to limit bloat. Why use an X server
>that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
>of code with very minimal kmallocing?

FBUI does not have 3d acceleration?

Ken Moffat wrote:
>>The point is that -Os is *much* less tested
>>than -O2 at the moment.

>Because people suck, and don't use it and hence test it.

I doubt even the -O2-only-people use gprof/gcov frequently. :(



Jan Engelhardt
-- 
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:25                                       ` Lee Revell
@ 2004-10-31 14:06                                         ` Diego Calleja
  2004-10-31 20:53                                           ` Z Smith
  0 siblings, 1 reply; 99+ messages in thread
From: Diego Calleja @ 2004-10-31 14:06 UTC (permalink / raw)
  To: Lee Revell; +Cc: vda, torvalds, ak, linux-kernel

El Sat, 30 Oct 2004 18:25:38 -0400 Lee Revell <rlrevell@joe-job.com> escribió:

> I ageww it's a hard problem.  Right now there is massive pressure on
> Linux application developers to add features to catch up with MS and
> Apple.  This inevitably leads to bloat, we all know that efficiency is

I don't think it's so bad (ie: it could be _worse_)

There's some work going on to fix some "bloat problems" too, for example
the x.org people are working in a sort of xlib complement/replacement (i
don't know its real purpose) xcb which should help latency and code
size. Composite itself is a nice way of avoiding that apps redraw their
windows all the time. KDE "speed" is better is much better than a year 
ago, gnome 2.8 is also somewhat "faster" (compare nautilus in gnome 2.6
vs the one in 2.8). Openoffice 2.0 also will have some "performance
improvements" (see http://development.openoffice.org/releases/q-concept.html#4.1.3.Performance|outline
and http://development.openoffice.org/releases/q-concept.html#3.1.3.Performance|outline)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  1:09                                               ` Ken Moffat
  2004-10-31  2:42                                                 ` Tim Connors
@ 2004-10-31 14:44                                                 ` Alan Cox
  1 sibling, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-10-31 14:44 UTC (permalink / raw)
  To: Ken Moffat
  Cc: Lee Revell, Denis Vlasenko, Tim Hockin, Linus Torvalds,
	Andi Kleen, Linux Kernel Mailing List

On Sul, 2004-10-31 at 01:09, Ken Moffat wrote:
> and the time to load it is irrelevant.  Since then I've had an anecdotal
> report that -Os is known to cause problems with gnome.  I s'pose people
> will say it serves me right for doing my initial testing on ppc which
> didn't have this problem ;)  The point is that -Os is *much* less tested
> than -O2 at the moment.

I've seen no real problems - x86-32 or x86-64, and my gnumeric appears
happy. Could be that the Red Hat gcc 3.3 has the relevant fixes already
in it from upstream I guess.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  1:21                                             ` Z Smith
  2004-10-31  2:47                                               ` Jim Nelson
@ 2004-10-31 15:19                                               ` Alan Cox
  2004-10-31 20:18                                                 ` Z Smith
  1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-31 15:19 UTC (permalink / raw)
  To: Z Smith; +Cc: Linux Kernel Mailing List

On Sul, 2004-10-31 at 01:21, Z Smith wrote:
> Alan Cox wrote:
> 
> > So if the desktop stuff is annoying you join gnome-love or whatever the
> > kde equivalent is 8)
> 
> Or join me in my effort to limit bloat. Why use an X server
> that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> of code with very minimal kmallocing?

My X server seems to be running at about 4Mbytes, plus the frame buffer
mappings which make it appear a lot larger. I wouldn't be suprised if
half the 4Mb was pixmap cache too, maybe more.

I've helped write tiny UI kits (take a look at nanogui for example) but
they don't have the flexibility of X.

Alan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-30 22:44                                         ` Jeff Garzik
  2004-10-30 22:50                                           ` Tim Hockin
@ 2004-10-31 20:15                                           ` Theodore Ts'o
  2004-10-31 20:21                                             ` Jeff Garzik
                                                               ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Theodore Ts'o @ 2004-10-31 20:15 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
> Tim Hockin wrote:
> >So you end up with the mindset of, for example, "if it's text it's XML".
> >You have to parse everything as XML, when simple parsers would be tons
> >faster and simpler and smaller.
> 
> hehehe.  One of the reasons why I like XML is that you don't have to 
> keep cloning new parsers.

.... if you don't mind bloating your application:

% ls -l /usr/lib/libxml2.a
4224 -rw-r--r--  1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a

						- Ted


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 15:19                                               ` Alan Cox
@ 2004-10-31 20:18                                                 ` Z Smith
  2004-11-01 11:05                                                   ` Alan Cox
  0 siblings, 1 reply; 99+ messages in thread
From: Z Smith @ 2004-10-31 20:18 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Alan Cox wrote:

> My X server seems to be running at about 4Mbytes, plus the frame buffer
> mappings which make it appear a lot larger. I wouldn't be suprised if
> half the 4Mb was pixmap cache too, maybe more.

At first sight that sounds like a plausible explanation, however
the facts in my case suggest something else is going on:

My laptop's framebuffer is only 800x600x24bpp VESA, or 1406kB.
But look at what X is doing:

root       632  6.1 17.5 22024 16440 ?       S    12:05   0:17 X :0

The more apps in use, the more memory is used, but at the moment
I've only got xterm, rxvt, thunderbird, xclock and xload. My wm is
blackbox which is using 5 megs.

Also, just curious but why would memory-mapped I/O be counted
in the memory usage anyway? Shouldn't there be a separate number
for framebuffer memory and the like?

> I've helped write tiny UI kits (take a look at nanogui for example) but
> they don't have the flexibility of X.

In my experience, most of the flexibility is not necessary for
97% of what I do, yet it evidently costs a lot in memory usage
and speed.

Zack

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:15                                           ` Theodore Ts'o
@ 2004-10-31 20:21                                             ` Jeff Garzik
  2004-10-31 21:06                                             ` Jan Engelhardt
  2004-11-01 11:27                                             ` Alan Cox
  2 siblings, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-31 20:21 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

Theodore Ts'o wrote:
> On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
> 
>>Tim Hockin wrote:
>>
>>>So you end up with the mindset of, for example, "if it's text it's XML".
>>>You have to parse everything as XML, when simple parsers would be tons
>>>faster and simpler and smaller.
>>
>>hehehe.  One of the reasons why I like XML is that you don't have to 
>>keep cloning new parsers.
> 
> 
> .... if you don't mind bloating your application:
> 
> % ls -l /usr/lib/libxml2.a
> 4224 -rw-r--r--  1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a

GLib's is a lot smaller :)

	Jeff




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 14:06                                         ` Diego Calleja
@ 2004-10-31 20:53                                           ` Z Smith
  2004-10-31 23:35                                             ` Rogério Brito
  2004-11-01 14:48                                             ` Diego Calleja
  0 siblings, 2 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31 20:53 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel

Diego Calleja wrote:

> I don't think it's so bad (ie: it could be _worse_)

But not everyone can tolerate today's level of bloat.

Imagine a small charity in a rural town in Bolivia or
Colorado. They have no budget for computers and no one
is offering donations. A local person put Linux on their 200 MHz
system after Windows crashed and the Windows CD couldn't
be found, but he can't put KDE or Gnome on it as well because
that would bring it to a crawl. The only way to make the
computer usable is to install an old distribution of Linux
from 1998 which has Netscape 4 but no office app. Eventually
they will give up on the computer and just throw it out,
because they can't wait forever for programmers to write
non-bloated software to make good use of their system.
The machine ends up at a landfill where it leeches chemicals
into the local water supply.

Zack

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:15                                           ` Theodore Ts'o
  2004-10-31 20:21                                             ` Jeff Garzik
@ 2004-10-31 21:06                                             ` Jan Engelhardt
  2004-11-01 11:27                                             ` Alan Cox
  2 siblings, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 21:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jeff Garzik, linux-kernel

>.... if you don't mind bloating your application:
>
>% ls -l /usr/lib/libxml2.a
>4224 -rw-r--r--  1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a

Whoa. Bet its creator compiled with -g -O2 rather than -g0 -O2. ANd with
-static instead of with <dynamic>. Yay look at this:

22:06 io:~ # l /usr/lib/libxml2.so -L
#SUSE# -rwxr-xr-x  1 root root 1145089 Apr  6  2004 /usr/lib/libxml2.so

4x smaller!



Jan Engelhardt
-- 
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  6:49                                         ` Jan Engelhardt
@ 2004-10-31 21:09                                           ` Z Smith
  2004-10-31 21:13                                             ` Jan Engelhardt
  2004-11-01 15:17                                           ` Lee Revell
  1 sibling, 1 reply; 99+ messages in thread
From: Z Smith @ 2004-10-31 21:09 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel

Jan Engelhardt wrote:

> FBUI does not have 3d acceleration?

The problem is 3d non-acceleration i.e. VESA and VGA
would still have to be supported. I'm no 3d expert but
I think there must be some software-based 3d function
would require using floating point, which isn't allowed
in the kernel.

Also, might not software 3d open the kernel up to
patent issues?

Zachary Smith

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 21:09                                           ` Z Smith
@ 2004-10-31 21:13                                             ` Jan Engelhardt
  2004-10-31 21:48                                               ` Z Smith
  2004-11-01 11:29                                               ` Alan Cox
  0 siblings, 2 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 21:13 UTC (permalink / raw)
  To: Z Smith; +Cc: linux-kernel

>> FBUI does not have 3d acceleration?
>
>The problem is 3d non-acceleration i.e. VESA and VGA
>would still have to be supported. I'm no 3d expert but
>I think there must be some software-based 3d function
>would require using floating point, which isn't allowed
>in the kernel.
>
>Also, might not software 3d open the kernel up to
>patent issues?

Whatever you do, 3D at the software level is slow, even with a fast comp.
See MESA.



Jan Engelhardt
-- 
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 21:13                                             ` Jan Engelhardt
@ 2004-10-31 21:48                                               ` Z Smith
  2004-11-01 11:29                                               ` Alan Cox
  1 sibling, 0 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31 21:48 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel

Jan Engelhardt wrote:

>>Also, might not software 3d open the kernel up to
>>patent issues?
> 
> Whatever you do, 3D at the software level is slow, even with a fast comp.
> See MESA.

Well it might be nice to add support for hardware 3-D, once 2-D
is mature. In fact I imagine it could be very convenient for
some people.

ZS

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:53                                           ` Z Smith
@ 2004-10-31 23:35                                             ` Rogério Brito
  2004-11-01  1:20                                               ` Z Smith
  2004-11-01 14:48                                             ` Diego Calleja
  1 sibling, 1 reply; 99+ messages in thread
From: Rogério Brito @ 2004-10-31 23:35 UTC (permalink / raw)
  To: Z Smith; +Cc: Diego Calleja, linux-kernel

Z Smith wrote:
> But not everyone can tolerate today's level of bloat.
> 
> Imagine a small charity in a rural town in Bolivia or
> Colorado. They have no budget for computers and no one
> is offering donations.

Well, let me jump into this thread. I don't live in Bolivia or Colorado, 
but I do live in Brazil.

The fastest computer that I have at my disposal is this one with a Duron 
600MHz processor. My father uses a Pentium MMX 200MHz with 64MB of RAM. 
Unfortunately, for financial reasons, I don't see we upgrading our 
computers too soo.

It is nice to read Alan Cox saying that the Gnome team can make Gnome 
use less memory in the future. I'm anxiously looking forward to that. In 
the mean time, I will be using fluxbox and hoping that other parts of 
the system (libraries etc) don't grow too fast for my computers.

I know plenty of people in the same situation that I am. Given the 
choice of purchasing a book for my education or upgrading my computer, I 
guess that I should spend money on the former.

And the same is true for many of my relatives and friends.


Rogério Brito.

-- 
Learn to quote e-mails decently at:
http://pub.tsn.dk/how-to-quote.php
http://learn.to/quote
http://www.xs4all.nl/~sbpoley/toppost.htm

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 23:35                                             ` Rogério Brito
@ 2004-11-01  1:20                                               ` Z Smith
  0 siblings, 0 replies; 99+ messages in thread
From: Z Smith @ 2004-11-01  1:20 UTC (permalink / raw)
  To: Rogério Brito; +Cc: Diego Calleja, linux-kernel

Rogério Brito wrote:
> Z Smith wrote:

> The fastest computer that I have at my disposal is this one with a Duron 
> 600MHz processor. My father uses a Pentium MMX 200MHz with 64MB of RAM. 
> Unfortunately, for financial reasons, I don't see we upgrading our 
> computers too soo.

It seems that as time goes by, more and more people are
coming to be financially limited. In some cases the cause
is clearly the IMF / World Bank / WTO triad.

Some casual reading:
http://www.gregpalast.com/printerfriendly.cfm?artid=96

Zack

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:18                                                 ` Z Smith
@ 2004-11-01 11:05                                                   ` Alan Cox
  0 siblings, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:05 UTC (permalink / raw)
  To: Z Smith; +Cc: Linux Kernel Mailing List

On Sul, 2004-10-31 at 20:18, Z Smith wrote:
> My laptop's framebuffer is only 800x600x24bpp VESA, or 1406kB.
> But look at what X is doing:

X has the frame buffer mapped as reported by VESA sizing not the 
minimal for the mode. (Think about RandR and you'll see why)

> root       632  6.1 17.5 22024 16440 ?       S    12:05   0:17 X :0
> 
> The more apps in use, the more memory is used, but at the moment
> I've only got xterm, rxvt, thunderbird, xclock and xload. My wm is
> blackbox which is using 5 megs.

Mostly shared with the other apps, you did remember to divide each page
by the number of users ?

> Also, just curious but why would memory-mapped I/O be counted
> in the memory usage anyway? Shouldn't there be a separate number
> for framebuffer memory and the like?

Actually there is probably not enough information in /proc to do the
maths on it. The kernel itself has a clear idea which vma's are not
backed by ram in the usual sense as they are marked VM_IO.

> > I've helped write tiny UI kits (take a look at nanogui for example) but
> > they don't have the flexibility of X.
> 
> In my experience, most of the flexibility is not necessary for
> 97% of what I do, yet it evidently costs a lot in memory usage
> and speed.

So my X server is 1Mb larger because I can run networked apps and play
bzflag. Suits me as a tradeoff - I'm not saying it always is the right
decision - nanogui works well in restricted environments like video
recorders for example.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:15                                           ` Theodore Ts'o
  2004-10-31 20:21                                             ` Jeff Garzik
  2004-10-31 21:06                                             ` Jan Engelhardt
@ 2004-11-01 11:27                                             ` Alan Cox
  2004-11-01 13:40                                               ` Denis Vlasenko
  2 siblings, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jeff Garzik, Linux Kernel Mailing List

On Sul, 2004-10-31 at 20:15, Theodore Ts'o wrote:
> .... if you don't mind bloating your application:
> 
> % ls -l /usr/lib/libxml2.a
> 4224 -rw-r--r--  1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a

Except that
1. The file size has nothing to do with the binary size as it is full of
symbols and maybe debug
2. Most of the pages of libxml2.so don't get paged in by a typical
application
3. If you have existing apps using it then its cost to you is nearly
zero because its already loaded.

libxml2 is a very complete validating all singing all dancing XML
parser. There are small non-validating parsers without every conceivable
glue interface that come down to about 10K.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 21:13                                             ` Jan Engelhardt
  2004-10-31 21:48                                               ` Z Smith
@ 2004-11-01 11:29                                               ` Alan Cox
  2004-11-01 12:36                                                 ` Jan Engelhardt
  1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:29 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Z Smith, Linux Kernel Mailing List

On Sul, 2004-10-31 at 21:13, Jan Engelhardt wrote:
> Whatever you do, 3D at the software level is slow, even with a fast comp.
> See MESA.

If you are willing to lose a few bits of OpenGL you can do 3D pretty
fast in software for gaming. Take a look at stuff like TinyGL


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-11-01 11:29                                               ` Alan Cox
@ 2004-11-01 12:36                                                 ` Jan Engelhardt
  0 siblings, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-11-01 12:36 UTC (permalink / raw)
  To: Alan Cox; +Cc: Z Smith, Linux Kernel Mailing List

>> Whatever you do, 3D at the software level is slow, even with a fast comp.
>> See MESA.
>
>If you are willing to lose a few bits of OpenGL you can do 3D pretty
>fast in software for gaming. Take a look at stuff like TinyGL

Ok, you're right. But to be honest, it does not need to be GL. Just look at
UnrealTournament (runs fine on a PII W98 w/233MHz, in software mode!)



Jan Engelhardt
-- 
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-11-01 11:27                                             ` Alan Cox
@ 2004-11-01 13:40                                               ` Denis Vlasenko
  2004-11-01 23:04                                                 ` Alan Cox
  0 siblings, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-11-01 13:40 UTC (permalink / raw)
  To: Alan Cox, Theodore Ts'o; +Cc: Jeff Garzik, Linux Kernel Mailing List

On Monday 01 November 2004 13:27, Alan Cox wrote:
> 2. Most of the pages of libxml2.so don't get paged in by a typical
> application

This assumes that 'needed' functions are close together.
This can be easily not the case, so you end up using only
a fraction of fetched page's content.

Also this argument tend to defend library growth.
"It's mostly unused, don't worry". What if that
is not true? How to compare RAM footprint
of new versus old lib in this case?
Just believe that it didn't get worse?

This can't be checked easily:
even -static compile can fail to help.
glibc produce nearly 400kb executable for

int main() { return 0; }

because init code uses printf on error paths and
that pulls i18n in. How many kilobytes is really
runs - who knows...
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31 20:53                                           ` Z Smith
  2004-10-31 23:35                                             ` Rogério Brito
@ 2004-11-01 14:48                                             ` Diego Calleja
  2004-11-01 15:09                                               ` [OT] " Russell Miller
  1 sibling, 1 reply; 99+ messages in thread
From: Diego Calleja @ 2004-11-01 14:48 UTC (permalink / raw)
  To: Z Smith; +Cc: linux-kernel

El Sun, 31 Oct 2004 12:53:21 -0800 Z Smith <plinius@comcast.net> escribió:

> But not everyone can tolerate today's level of bloat.

Sadly it's true, but in the other hand I haven't seen something like gnome/kde
which don't eats lots of resources (mac os x and XP are not better, beos was
better they say), which makes me think that building a  desktop environment
without eating lots of resources is not easy. Well, and your projct is also
bloat in some ways...it's small and all that but putting a graphics system
inside the kernel is one of the best definitions of "bloat" you can find...

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-11-01 14:48                                             ` Diego Calleja
@ 2004-11-01 15:09                                               ` Russell Miller
  0 siblings, 0 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-01 15:09 UTC (permalink / raw)
  To: linux-kernel

On Monday 01 November 2004 08:48, Diego Calleja wrote:

> Sadly it's true, but in the other hand I haven't seen something like
> gnome/kde which don't eats lots of resources (mac os x and XP are not
> better, beos was better they say)

Part of the problem with KDE is the QT library underneath it all.  QT 4 is 
supposed to be leaner and faster.  The KDE folks seem to be trying pretty 
hard to reduce bloat whenever possible.  But when you have software that's 
expected to have the kitchen sink, it's especially challenging to reduce the 
footprint while keeping all of the functionality.

I use openbox on my laptop.  It's nothing near KDE in terms of functionality, 
but it also runs reasonably snappy on a Pentium 266, so I can't complain too 
much.

So far I'm pretty glad that the linux kernel developers have resisted putting 
graphics calls and routines into the kernel.  It slows things down a bit, but 
I'd like to think you guys have learned from MS's mistakes.  IMO one of the 
biggest mistakes they ever made was to pollute the NT kernel with the 
graphics subsystem.  That said, FBUI looks like an interesting add-on 
project.

Enough of my off topic ranting...

--Russell

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-10-31  6:49                                         ` Jan Engelhardt
  2004-10-31 21:09                                           ` Z Smith
@ 2004-11-01 15:17                                           ` Lee Revell
  2004-11-01 16:56                                             ` Kristian Høgsberg
  1 sibling, 1 reply; 99+ messages in thread
From: Lee Revell @ 2004-11-01 15:17 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel, xorg

On Sun, 2004-10-31 at 07:49 +0100, Jan Engelhardt wrote:
> Z Smith wrote:
> >Or join me in my effort to limit bloat. Why use an X server
> >that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> >of code with very minimal kmallocing?
> 
> FBUI does not have 3d acceleration?

Um I don't think chucking X is the answer.  The problem is that it's
embarassingly slow compared to any modern GUI.  If the display were as
snappy as WinXP I don't care if it's 200MB.  On my desktop I constantly
see windows redrawing every freaking widget in situations where XP would
just blit from an offscreen buffer or something.

Anyway please keep replies off LKML and on the Xorg list...

Lee


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-11-01 15:17                                           ` Lee Revell
@ 2004-11-01 16:56                                             ` Kristian Høgsberg
  0 siblings, 0 replies; 99+ messages in thread
From: Kristian Høgsberg @ 2004-11-01 16:56 UTC (permalink / raw)
  To: Discuss issues related to the xorg tree; +Cc: Jan Engelhardt, linux-kernel

Lee Revell wrote:
> On Sun, 2004-10-31 at 07:49 +0100, Jan Engelhardt wrote:
> 
>>Z Smith wrote:
>>
>>>Or join me in my effort to limit bloat. Why use an X server
>>>that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
>>>of code with very minimal kmallocing?
>>
>>FBUI does not have 3d acceleration?
> 
> 
> Um I don't think chucking X is the answer.  The problem is that it's
> embarassingly slow compared to any modern GUI.  If the display were as
> snappy as WinXP I don't care if it's 200MB.  On my desktop I constantly
> see windows redrawing every freaking widget in situations where XP would
> just blit from an offscreen buffer or something.
> 
> Anyway please keep replies off LKML and on the Xorg list...

Actually, please keep replies off the Xorg list as well.

Thanks,
Kristian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: code bloat [was Re: Semaphore assembly-code bug]
  2004-11-01 13:40                                               ` Denis Vlasenko
@ 2004-11-01 23:04                                                 ` Alan Cox
  0 siblings, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-11-01 23:04 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Theodore Ts'o, Jeff Garzik, Linux Kernel Mailing List

On Llu, 2004-11-01 at 13:40, Denis Vlasenko wrote:
> This assumes that 'needed' functions are close together.
> This can be easily not the case, so you end up using only
> a fraction of fetched page's content.

And gprof will help you sort that out, along with -ffunction-sections
you can do pretty fine grained tidying



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 22:16                                     ` linux-os
  2004-11-01 22:26                                       ` Linus Torvalds
  2004-11-03  1:52                                       ` Horst von Brand
@ 2004-11-03 21:24                                       ` Bill Davidsen
  2 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-03 21:24 UTC (permalink / raw)
  To: linux-os
  Cc: Linus Torvalds, dean gaudet, Andreas Steinmetz,
	Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

linux-os wrote:

> You just don't get it. I, too, can make a so-called bench-mark
> that will "prove" something that's so incredibly invalid that
> it shouldn't even deserve an answer. However, because you
> are supposed to know what you are doing, I will give you
> an answer.
> 
> It is totally impossible to perform useful work with memory,
> i.e., poping the value of something from memory into a register,
> without incurring the cost of that memory access. It doesn't
> matter if the memory is in cache or if it needs to be read
> using the memory controller. Time is time and it never runs
> backwards. I spend most of my days with hardware logic analyzers
> looking at the memory accesses so I damn-well know what I
> am taking about. That memory-access takes a time-slot that
> something else can't use. You never get it back. It is gone
> forever. This is very important to understand. If you don't
> understand this, you can fall into the "black-magic" trap.
> 
> Modern CPUs make it easy for so-called software engineers to
> perceive so-called facts that are not, in fact, true. Because
> it is possible for the CPU to perform memory-access independent
> of instruction sequence (so-called parallel operations), it is
> possible to make bench-marks that prove nothing, but seem to
> show that a read from memory is free. It can never be free. It
> will eventually show up. It was just deferred. Of course, if
> your computer is just going to run that single bench-mark, then
> return to a prompt, you can readily become victum of a very
> common error because there is now plenty of time available to
> just spin (or wait for an interrupt).
> 
> So, if you really want to make things fast, you keep your
> memory accesses to the absolute minimum. Poping something
> from the stack is the antithesis of what you want to do.
> 
> It's really amusing. Software development has devolved
> into some black magic where logic, mathematics, and
> physical testing no longer thrive.
> 
> Instead, we must listen to those who profess to know
> about this magic because of some innate enlightenment
> imparted to those favored few who are able to perceive
> the trueness of their intellectual perception without
> regard for contrary physical observations.
> 
> It's wonderful to not be bothered by tests, measurements,
> documentation, or other facts.
> 
> Wake up and don't be dragged into the black-magic trap.

The election is over, we can adopt a civil non-confrontational tone 
again... Linus is not always right, but like most people he responds 
better to "let me give you additional information" than "I know more 
than you, take my word for it."

In this case, I think Dick does have a point on memory to cache use. It 
appears from what little stuff I have here that with HT cache access is 
serialized, and that memory access, even to L1 cache, might under some 
circumstances be delayed. I won't guess if you would ever see that in 
practice.

Getting information out of noisy measurements is not easy, and while 
Dick is probably right that the lowest time is the "real" time, if the 
average is lower doing something else, isn't that what we want?

My response test reports low, high, average, median, and 90th percentile 
values, and depending on whether you want the best average, best 
typical, or worst case avoidance you might find any of them useful. Oh, 
and S.D. of the data to hint on how much you trust the results. I don't 
think any of the test programs produce the definitive result, and I see 
that results change depending on the CPU.

I think there are a lot of things more deserving this level of 
consideration.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 22:16                                     ` linux-os
  2004-11-01 22:26                                       ` Linus Torvalds
@ 2004-11-03  1:52                                       ` Horst von Brand
  2004-11-03 21:24                                       ` Bill Davidsen
  2 siblings, 0 replies; 99+ messages in thread
From: Horst von Brand @ 2004-11-03  1:52 UTC (permalink / raw)
  To: linux-os
  Cc: Linus Torvalds, dean gaudet, Andreas Steinmetz,
	Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

linux-os <linux-os@chaos.analogic.com> said:

[...]

> Instead, we must listen to those who profess to know
> about this magic because of some innate enlightenment
> imparted to those favored few who are able to perceive
> the trueness of their intellectual perception without
> regard for contrary physical observations.

Right. Just go and tell that to somebody who actually designed one of the
competing CPU's inards. And who probably learnt nothing whatsowever on the
ones it was mimiking in the process.

> It's wonderful to not be bothered by tests, measurements,
> documentation, or other facts.

How true.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-02 16:06                                           ` Linus Torvalds
@ 2004-11-02 16:51                                             ` linux-os
  0 siblings, 0 replies; 99+ messages in thread
From: linux-os @ 2004-11-02 16:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Tue, 2 Nov 2004, Linus Torvalds wrote:

>
>
> On Tue, 2 Nov 2004, Linus Torvalds wrote:
>>
>> Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this
>> is from memory) back to a "0"
>
> .. or just use the current -bk snapshot, actually. I may not have x86 as
> my main desktop, but it's not like I had a really hard time finding one
> (like the laptop laying there right on top of the desk ;), so the fixed
> version got checked in already.
>
> 		Linus

Okay. I got linux-2.6.9 back up.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-02 16:02                                         ` Linus Torvalds
@ 2004-11-02 16:06                                           ` Linus Torvalds
  2004-11-02 16:51                                             ` linux-os
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-02 16:06 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Tue, 2 Nov 2004, Linus Torvalds wrote:
> 
> Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this 
> is from memory) back to a "0"

.. or just use the current -bk snapshot, actually. I may not have x86 as 
my main desktop, but it's not like I had a really hard time finding one 
(like the laptop laying there right on top of the desk ;), so the fixed 
version got checked in already.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-02 15:02                                       ` linux-os
@ 2004-11-02 16:02                                         ` Linus Torvalds
  2004-11-02 16:06                                           ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-02 16:02 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Tue, 2 Nov 2004, linux-os wrote:
> 
> The patch you provided patched without any rejects. However,
> the system won't boot.

Yes, there was a incorrect change to the "asmlinkage" definition that I 
had played with before deciding to make just the semaphores be reg-arg, 
and that change made it into my original patch by mistake. I sent out a 
second message asking people to remove that part of the patch some time 
later, but..

> I patched Linux-2.6.9. Could you please review your patch?
> I will await the possibility of a simple typo that I can
> fix by hand before reverting.

Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this 
is from memory):

	#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))

back to a "0". Asmlinkage still uses stack-based parameter passing, which
I'd love to fix eventually (we've had bugs in that area too), but it is
just too much pain to do right now.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 21:46                                     ` Linus Torvalds
@ 2004-11-02 15:02                                       ` linux-os
  2004-11-02 16:02                                         ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-11-02 15:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka


Linus,

The patch you provided patched without any rejects. However,
the system won't boot. It will not even get to
  "Uncompressing Linux". After the GRUB loader sign-on,
the interrupts just remain disabled (no caps-lock or num-lock
change on the keyboard).

I patched Linux-2.6.9. Could you please review your patch?
I will await the possibility of a simple typo that I can
fix by hand before reverting.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.8 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 21:40                                   ` Linus Torvalds
  2004-11-01 21:46                                     ` Linus Torvalds
  2004-11-01 22:16                                     ` linux-os
@ 2004-11-02  6:37                                     ` Chris Friesen
  2 siblings, 0 replies; 99+ messages in thread
From: Chris Friesen @ 2004-11-02  6:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

Linus Torvalds wrote:

> On Intel, if I recall correctly, rdtsc is totally serializing, so you're
> testing not just the instructions between the rdtsc's, but the length of
> the pipeline, and the time it takes for stuff around it to calm down.  

Actually, the Intel docs say that rdtsc is not serializing (specifically for the 
P6 series, but linked off the P4 section of the site) and their sample 
performance measuring code for the P4 shows it using a serializing instruction 
before the call to rdtsc.

Chris

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 23:14                                         ` linux-os
@ 2004-11-01 23:42                                           ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 23:42 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Mon, 1 Nov 2004, linux-os wrote:
> 
> No. You've just shown that you like to argue. I recall that you
> recently, like within the past 24 hours, supplied a patch that
> got rid of the time-consuming stack operations in your semaphore
> code. Remember, you changed it to pass parameters in registers.

... because that fixed a _bug_.

> Why would you bother if stack operations are free?

I didn't say that instructions are free. I just tried (unsuccessfully) to 
tell you that "lea" is not free either, and that "lea" has some serious 
problems on several setups, ranging from old cpu's (AGI stalls) to new 
CPU's (stack engine stalls). And that "pop" is often faster.

And you have been arguing against it despite the fact that I ended up 
writing a small test-program to show that it's true. It's a _stupid_ 
test-program, but the fact is, you only need a single test-case to prove 
some theory wrong.

Your theory that "lea" is somehow always cheaper than "pop" is wrong. 

> It's not a total focus. It's just necessary emphasis. Any work
> done by your computer, ultimately comes from and goes to memory.

Not so.

A lot of work is done in cache. Any access that doesn't change the state
of the cache is a no-op as far as the memory bus is concerned. Ie a store
to a cacheline that is already dirty is just a cache access, as is a load
from a cacheline that is already loaded.

This is especially true on x86 CPU's, where the lack of registers means 
that the core has been highly optimized for doing cached operations. 
Remember: a CPU is not some kind of abstract entity - it's a very 
practical piece of engineering that has been highly optimized for certain 
usage patterns.

And the fact is, "lea" on %esp is not a common usage pattern. Which is 
why, in practice, you will find CPU's that end up not optimizing for it. 
While "pop"+"pop" is a _very_ common pattern, and why existing CPU's 
do them efficiently.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 22:26                                       ` Linus Torvalds
@ 2004-11-01 23:14                                         ` linux-os
  2004-11-01 23:42                                           ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-11-01 23:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Mon, 1 Nov 2004, Linus Torvalds wrote:

>
>
> On Mon, 1 Nov 2004, linux-os wrote:
>>
>> You just don't get it. I, too, can make a so-called bench-mark
>> that will "prove" something that's so incredibly invalid that
>> it shouldn't even deserve an answer.
>
> *Plonk*
>
> You've just shown that not only do you ignore well-educated people who
> tell you why pipelines can have trouble with "lea", you also ignore hard
> numbers.
>

No. You've just shown that you like to argue. I recall that you
recently, like within the past 24 hours, supplied a patch that
got rid of the time-consuming stack operations in your semaphore
code. Remember, you changed it to pass parameters in registers.

Why would you bother if stack operations are free? The fact is
that you know that even a single extra memory access (i.e., a
stack operation) is costly. You just don't want to admit that
(remember the original premise if this discussion) popping
into an unused register to level the stack, is NOT better than
adding to the stack-pointer or, as another learned engineer
advised, using LEA instead.

I simply wrote some code that showed that poping registers used
more CPU cycles than adding to the stack-pointer, and using
LEA instead of the ADD showed no difference. Of course I
was immediately overwhelmed by responses that the benchmark
was invalid, presumably because it wasn't written by somebody
else.

> Your total focus on a cached memory access as being somehow more expensive
> than anything else going in the CPU pipeline is sad.
>

It's not a total focus. It's just necessary emphasis. Any work
done by your computer, ultimately comes from and goes to memory.
Some is memory-mapped hardware "memory" some is simply RAM.
Managing those memory accesses is very important when it comes
to maximizing the work that your computer can do in a limited
period of time. Wasting memory-access time is something one
should not do if at all possible.

> But hey, I've run out of ways to show you wrong. If you believe the world
> is flat, that's your problem.
>
> 		Linus
>

No, the world is crooked, not flat.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 22:16                                     ` linux-os
@ 2004-11-01 22:26                                       ` Linus Torvalds
  2004-11-01 23:14                                         ` linux-os
  2004-11-03  1:52                                       ` Horst von Brand
  2004-11-03 21:24                                       ` Bill Davidsen
  2 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 22:26 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Mon, 1 Nov 2004, linux-os wrote:
> 
> You just don't get it. I, too, can make a so-called bench-mark
> that will "prove" something that's so incredibly invalid that
> it shouldn't even deserve an answer.

*Plonk*

You've just shown that not only do you ignore well-educated people who 
tell you why pipelines can have trouble with "lea", you also ignore hard 
numbers.

Your total focus on a cached memory access as being somehow more expensive
than anything else going in the CPU pipeline is sad.

But hey, I've run out of ways to show you wrong. If you believe the world 
is flat, that's your problem.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 21:23                                   ` dean gaudet
@ 2004-11-01 22:22                                     ` linux-os
  0 siblings, 0 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 22:22 UTC (permalink / raw)
  To: dean gaudet
  Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Mon, 1 Nov 2004, dean gaudet wrote:

> On Mon, 1 Nov 2004, linux-os wrote:
>
>> On Mon, 1 Nov 2004, dean gaudet wrote:
>>
>>> On Sun, 31 Oct 2004, linux-os wrote:
>>>
>>>> Timer overhead = 88 CPU clocks
>>>> push 3, pop 3 = 12 CPU clocks
>>>> push 3, pop 2 = 12 CPU clocks
>>>> push 3, pop 1 = 12 CPU clocks
>>>> push 3, pop none using ADD = 8 CPU clocks
>>>> push 3, pop none using LEA = 8 CPU clocks
>>>> push 3, pop into same register = 12 CPU clocks
>>>
>>> your microbenchmark makes assumptions about rdtsc which haven't been valid
>>> since the days of the 486.  rdtsc has serializing aspects and overhead that
>>> you can't just eliminate by running it in a tight loop and subtracting out
>>> that "overhead".
>>>
>>
>> Wrong.
>
> if you were correct then i should be able to measure 1 cycle differences
> in sequences such as the following:
[SNIPPED...]

Who said? The resolution isn't even specified. Experimental
results with several different processors seem to show that
the resolution is about 4 cycles.

Script started on Mon 01 Nov 2004 04:48:04 PM EST
# ./tester
Timer overhead = 88 CPU clocks


1 nop = 4 CPU clocks
2 nops = 4 CPU clocks
3 nops = 4 CPU clocks
4 nops = 8 CPU clocks
5 nops = 8 CPU clocks
6 nops = 8 CPU clocks
7 nops = 8 CPU clocks
8 nops = 12 CPU clocks
# exit
Script done on Mon 01 Nov 2004 04:48:34 PM EST

Assembly :


nop8:	nop
nop7:	nop
nop6:	nop
nop5:	nop
nop4:	nop
nop3:	nop
nop2:	nop
nop1:	nop
 	ret

.global	nop1
.global	nop2
.global	nop3
.global	nop4
.global	nop5
.global	nop6
.global	nop7
.global	nop8


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 21:40                                   ` Linus Torvalds
  2004-11-01 21:46                                     ` Linus Torvalds
@ 2004-11-01 22:16                                     ` linux-os
  2004-11-01 22:26                                       ` Linus Torvalds
                                                         ` (2 more replies)
  2004-11-02  6:37                                     ` Chris Friesen
  2 siblings, 3 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Mon, 1 Nov 2004, Linus Torvalds wrote:

>
>
> On Mon, 1 Nov 2004, linux-os wrote:
>>
>> Wrong.
>>
>> (1)  The '486 didn't have the rdtsc instruction.
>> (2)  There are no 'serializing' or other black-magic aspects of
>> using the internal cycle-counter. That's exactly how you you
>> can benchmark the execution time of accessible code sequences.
>
> Sorry, but you shouldn't argue with people who know more than you do. I
> know Dean, and he analyzes things for work, and does know what he is
> doing.
>
> "rdtsc" _does_ partly serialize things, and it's not even architecturally
> defined, so you'll find that it serializes things in different ways on
> different CPU's. You can't just do
>
> 	rdtsc
> 	...
> 	rdtsc
>
> and expect the stuff in between the rdtsc's to be timed exactly: some of
> it will overlap with the rdtsc's, some of it won't.
>
> On Intel, if I recall correctly, rdtsc is totally serializing, so you're
> testing not just the instructions between the rdtsc's, but the length of
> the pipeline, and the time it takes for stuff around it to calm down.
> Which is why two rdtsc's in sequence will show quite a lot of overhead on
> a P4 (something like 80 cycles).
>
> So you really want to do more operations in between the rdtsc's.
>
> Try the appended program. On a P4, the two sequnces are the same for me
> (92 cycles, 80 cycles overhead), while on a Pentium M, the sequence of two
> popl's (57 cycles) is faster than the sequence of "lea+popl" (59 cycles)
> and the overhead is 47 cycles.
>
> So can you _please_ just admit that you were wrong? On a P4, the pop/pop
> is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
> according to this test. Your contention that "pop" has to be slower than
> "lea" is WRONG.
>
> 		Linus
>
> ----
> #define PUSHEBX "pushl %%ebx\n\t"
> #define PUSHECX "pushl %%ecx\n\t"
> #define POPECX "popl %%ecx\n\t"
> #define POPEBX "popl %%ebx\n\t"
>
> #ifdef TEST_LEA
>
> #undef POPECX
> #define POPECX "leal 4(%%esp),%%esp\n\t"
>
> #endif
>
> #ifdef TEST_OVERHEAD
>
> #undef PUSHEBX
> #undef PUSHECX
> #undef POPEBX
> #undef POPECX
>
> #define PUSHEBX
> #define PUSHECX
> #define POPEBX
> #define POPECX
>
> #endif
>
> int main(void)
> {
> 	unsigned long start;
> 	unsigned long long end;
>
> 	asm volatile(
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		PUSHEBX
> 		PUSHECX
> 		"rdtsc\n\t"
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		POPECX
> 		POPEBX
> 		"movl %%eax,%%esi\n\t"
> 		"rdtsc"
> 		:"=A" (end), "=S" (start));
> 	printf("%ld cycles\n", (long) end-start);
> }
>

You just don't get it. I, too, can make a so-called bench-mark
that will "prove" something that's so incredibly invalid that
it shouldn't even deserve an answer. However, because you
are supposed to know what you are doing, I will give you
an answer.

It is totally impossible to perform useful work with memory,
i.e., poping the value of something from memory into a register,
without incurring the cost of that memory access. It doesn't
matter if the memory is in cache or if it needs to be read
using the memory controller. Time is time and it never runs
backwards. I spend most of my days with hardware logic analyzers
looking at the memory accesses so I damn-well know what I
am taking about. That memory-access takes a time-slot that
something else can't use. You never get it back. It is gone
forever. This is very important to understand. If you don't
understand this, you can fall into the "black-magic" trap.

Modern CPUs make it easy for so-called software engineers to
perceive so-called facts that are not, in fact, true. Because
it is possible for the CPU to perform memory-access independent
of instruction sequence (so-called parallel operations), it is
possible to make bench-marks that prove nothing, but seem to
show that a read from memory is free. It can never be free. It
will eventually show up. It was just deferred. Of course, if
your computer is just going to run that single bench-mark, then
return to a prompt, you can readily become victum of a very
common error because there is now plenty of time available to
just spin (or wait for an interrupt).

So, if you really want to make things fast, you keep your
memory accesses to the absolute minimum. Poping something
from the stack is the antithesis of what you want to do.

It's really amusing. Software development has devolved
into some black magic where logic, mathematics, and
physical testing no longer thrive.

Instead, we must listen to those who profess to know
about this magic because of some innate enlightenment
imparted to those favored few who are able to perceive
the trueness of their intellectual perception without
regard for contrary physical observations.

It's wonderful to not be bothered by tests, measurements,
documentation, or other facts.

Wake up and don't be dragged into the black-magic trap.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 21:40                                   ` Linus Torvalds
@ 2004-11-01 21:46                                     ` Linus Torvalds
  2004-11-02 15:02                                       ` linux-os
  2004-11-01 22:16                                     ` linux-os
  2004-11-02  6:37                                     ` Chris Friesen
  2 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 21:46 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Mon, 1 Nov 2004, Linus Torvalds wrote:
> 
> So can you _please_ just admit that you were wrong? On a P4, the pop/pop 
> is the same cost as lea/pop, and on a Pentium M the pop/pop is faster, 
> according to this test. Your contention that "pop" has to be slower than 
> "lea" is WRONG. 

Btw, I'd like to emphasize "this test". Modern OoO CPU's are complex 
animals. They have pipeline quirks etc that just means that things depend 
on alignment, on code around it, and on register usage patterns of the 
instructions that you test _and_ the instructions around those 
instructions. So take any proof with a pinch of salt, because there are 
bound to be other circumstances where factors around the code just change 
the assumptions.

In short, any time you're looking at single cycle timings, you should be 
very aware of the fact that your measurements are suspect. The best way to 
avoid most of the problem is to never try to measure single cycles. 
Measure performance on a program, not on a single instruction.

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 20:52                                 ` linux-os
  2004-11-01 21:23                                   ` dean gaudet
@ 2004-11-01 21:40                                   ` Linus Torvalds
  2004-11-01 21:46                                     ` Linus Torvalds
                                                       ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 21:40 UTC (permalink / raw)
  To: linux-os
  Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Mon, 1 Nov 2004, linux-os wrote:
> 
> Wrong.
> 
> (1)  The '486 didn't have the rdtsc instruction.
> (2)  There are no 'serializing' or other black-magic aspects of
> using the internal cycle-counter. That's exactly how you you
> can benchmark the execution time of accessible code sequences.

Sorry, but you shouldn't argue with people who know more than you do. I 
know Dean, and he analyzes things for work, and does know what he is 
doing. 

"rdtsc" _does_ partly serialize things, and it's not even architecturally 
defined, so you'll find that it serializes things in different ways on 
different CPU's. You can't just do

	rdtsc
	...
	rdtsc

and expect the stuff in between the rdtsc's to be timed exactly: some of 
it will overlap with the rdtsc's, some of it won't.

On Intel, if I recall correctly, rdtsc is totally serializing, so you're
testing not just the instructions between the rdtsc's, but the length of
the pipeline, and the time it takes for stuff around it to calm down.  
Which is why two rdtsc's in sequence will show quite a lot of overhead on
a P4 (something like 80 cycles).

So you really want to do more operations in between the rdtsc's.

Try the appended program. On a P4, the two sequnces are the same for me 
(92 cycles, 80 cycles overhead), while on a Pentium M, the sequence of two 
popl's (57 cycles) is faster than the sequence of "lea+popl" (59 cycles) 
and the overhead is 47 cycles.

So can you _please_ just admit that you were wrong? On a P4, the pop/pop 
is the same cost as lea/pop, and on a Pentium M the pop/pop is faster, 
according to this test. Your contention that "pop" has to be slower than 
"lea" is WRONG. 

		Linus

----
#define PUSHEBX "pushl %%ebx\n\t"
#define PUSHECX "pushl %%ecx\n\t"
#define POPECX "popl %%ecx\n\t"
#define POPEBX "popl %%ebx\n\t"

#ifdef TEST_LEA

#undef POPECX
#define POPECX "leal 4(%%esp),%%esp\n\t"

#endif

#ifdef TEST_OVERHEAD

#undef PUSHEBX
#undef PUSHECX
#undef POPEBX
#undef POPECX

#define PUSHEBX
#define PUSHECX
#define POPEBX
#define POPECX

#endif

int main(void)
{
	unsigned long start;
	unsigned long long end;

	asm volatile(
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		"rdtsc\n\t"
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		"movl %%eax,%%esi\n\t"
		"rdtsc"
		:"=A" (end), "=S" (start));
	printf("%ld cycles\n", (long) end-start);
}


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 20:52                                 ` linux-os
@ 2004-11-01 21:23                                   ` dean gaudet
  2004-11-01 22:22                                     ` linux-os
  2004-11-01 21:40                                   ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-11-01 21:23 UTC (permalink / raw)
  To: linux-os
  Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Mon, 1 Nov 2004, linux-os wrote:

> On Mon, 1 Nov 2004, dean gaudet wrote:
> 
> > On Sun, 31 Oct 2004, linux-os wrote:
> > 
> > > Timer overhead = 88 CPU clocks
> > > push 3, pop 3 = 12 CPU clocks
> > > push 3, pop 2 = 12 CPU clocks
> > > push 3, pop 1 = 12 CPU clocks
> > > push 3, pop none using ADD = 8 CPU clocks
> > > push 3, pop none using LEA = 8 CPU clocks
> > > push 3, pop into same register = 12 CPU clocks
> > 
> > your microbenchmark makes assumptions about rdtsc which haven't been valid
> > since the days of the 486.  rdtsc has serializing aspects and overhead that
> > you can't just eliminate by running it in a tight loop and subtracting out
> > that "overhead".
> > 
> 
> Wrong.

if you were correct then i should be able to measure 1 cycle differences
in sequences such as the following:

	rdtsc
	mov %eax,%edi
	shr $1,%ecx
	rdtsc

	rdtsc
	mov %eax,%edi
	shr $1,%ecx
	shr $1,%ecx
	rdtsc
...
	rdtsc
	mov %eax,%edi
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	shr $1,%ecx
	rdtsc

yet the attached program demonstrates that such measurements are
inaccurate.  the results should be a sequence of numbers increasing
by 1 each time.

p4 model 2:	80 80 84 84 84 84 84 84
p4 model 3:	120 120 120 120 120 120 120 128
p-m model 9:	47 46 47 48 49 50 56 57
k8:		5 5 5 5 5 5 5 5

-dean

% gcc -O -o rdtsc-rounding rdtsc-rounding.c

rdtsc-rounding.c:

#include <stdio.h>
#include <stdint.h>

#define template(n) \
	static uint32_t foo##n(void) \
	{ \
		uint32_t start, done, trash1, trash2; \
 \
		__asm volatile( \
			"\n	rdtsc" \
			"\n	mov %%eax,%0" \
			x##n("\n	shr $1,%1") \
			"\n	rdtsc" \
			: "=&r" (start), "=&r" (trash1), "=&a" (done), "=&d" (trash2) \
		); \
		return done - start; \
	}

#define x1(x) x
#define x2(x) x x
#define x3(x) x x x
#define x4(x) x2(x) x2(x)
#define x5(x) x4(x) x
#define x6(x) x3(x2(x))
#define x7(x) x6(x) x
#define x8(x) x4(x2(x))

template(1)
template(2)
template(3)
template(4)
template(5)
template(6)
template(7)
template(8)

static uint32_t (*fn[9])(void) = {
	0, foo1, foo2, foo3, foo4, foo5, foo6, foo7, foo8
};


static uint32_t bench(uint32_t (*f)(void))
{
	uint32_t best;
	unsigned i;

	best = ~0;
	for (i = 0; i < 100000; ++i) {
		uint32_t cur = f();
		if (cur < best) {
			best = cur;
		}
	}
	return best;
}


int main(int argc, char **argv)
{
	unsigned i;

	for (i = 1; i < sizeof(fn)/sizeof(fn[0]); ++i) {
		printf("%u ", bench(fn[i]));
	}
	printf("\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01 20:23                               ` dean gaudet
@ 2004-11-01 20:52                                 ` linux-os
  2004-11-01 21:23                                   ` dean gaudet
  2004-11-01 21:40                                   ` Linus Torvalds
  0 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 20:52 UTC (permalink / raw)
  To: dean gaudet
  Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Mon, 1 Nov 2004, dean gaudet wrote:

> On Sun, 31 Oct 2004, linux-os wrote:
>
>> Timer overhead = 88 CPU clocks
>> push 3, pop 3 = 12 CPU clocks
>> push 3, pop 2 = 12 CPU clocks
>> push 3, pop 1 = 12 CPU clocks
>> push 3, pop none using ADD = 8 CPU clocks
>> push 3, pop none using LEA = 8 CPU clocks
>> push 3, pop into same register = 12 CPU clocks
>
> your microbenchmark makes assumptions about rdtsc which haven't been valid 
> since the days of the 486.  rdtsc has serializing aspects and overhead that 
> you can't just eliminate by running it in a tight loop and subtracting out 
> that "overhead".
>

Wrong.

(1)  The '486 didn't have the rdtsc instruction.
(2)  There are no 'serializing' or other black-magic aspects of
using the internal cycle-counter. That's exactly how you you
can benchmark the execution time of accessible code sequences.

> you have to run your inner loops at least a few thousand of times between 
> rdtsc invocations and divide it out to find out the average cost in order to 
> eliminate the problems associated with rdtsc.
>
> -dean
>

You never average the cycle-time. The cycle-time is absolute.
You need to remove the affect of interrupts when you measure
performance so you need to sample a few times and save the
lowest number. That's the number obtained during the testing interval,
was not interrupted.

The provided code allows you to experiment. You can set the
TRIES count to 1. You will find that the results are noisy if
you are connected to an active network. Good results can be
obtained with it set to 4 if your computer is not being blasted
with lots of broadcast packets from M$ servers.

>

Of course you are not really interested in learning anything
about this are you?

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01  1:31                             ` linux-os
  2004-11-01  5:49                               ` Linus Torvalds
@ 2004-11-01 20:23                               ` dean gaudet
  2004-11-01 20:52                                 ` linux-os
  1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-11-01 20:23 UTC (permalink / raw)
  To: linux-os
  Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

[-- Attachment #1: Type: TEXT/PLAIN, Size: 757 bytes --]

On Sun, 31 Oct 2004, linux-os wrote:

> Timer overhead = 88 CPU clocks
> push 3, pop 3 = 12 CPU clocks
> push 3, pop 2 = 12 CPU clocks
> push 3, pop 1 = 12 CPU clocks
> push 3, pop none using ADD = 8 CPU clocks
> push 3, pop none using LEA = 8 CPU clocks
> push 3, pop into same register = 12 CPU clocks

your microbenchmark makes assumptions about rdtsc which haven't been valid 
since the days of the 486.  rdtsc has serializing aspects and overhead 
that you can't just eliminate by running it in a tight loop and 
subtracting out that "overhead".

you have to run your inner loops at least a few thousand of times between 
rdtsc invocations and divide it out to find out the average cost in order 
to eliminate the problems associated with rdtsc.

-dean

[-- Attachment #2: Type: APPLICATION/X-GZIP, Size: 6806 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-11-01  1:31                             ` linux-os
@ 2004-11-01  5:49                               ` Linus Torvalds
  2004-11-01 20:23                               ` dean gaudet
  1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01  5:49 UTC (permalink / raw)
  To: linux-os
  Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka



On Sun, 31 Oct 2004, linux-os wrote:
> 
> The attached file shows that the Intel Pentium 4 runs exactly as I
> described. Further, there is no difference in the CPU clocks used when
> adding a constant to the stack- pointer or using LEA.

Goodie. You found _one_ CPU that you think matters. On ethat doesn't even 
have the hardware that I've described. And you ignore all the other 
evidence. 

Good for you.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:12                           ` Linus Torvalds
@ 2004-11-01  1:31                             ` linux-os
  2004-11-01  5:49                               ` Linus Torvalds
  2004-11-01 20:23                               ` dean gaudet
  0 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-11-01  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2280 bytes --]

On Fri, 29 Oct 2004, Linus Torvalds wrote:

>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>
>> Linus, there is no way in hell that you are going to move
>> a value from memory into a register (pop ecx) faster than
>> you are going to do anything to the stack-pointer or
>> any other register.
>
> Sorry, but you're wrong.

I am not wrong.

I don't understand anything about your theoretical CPU
with the magic stack engine. Anything I can get my
hands on functions exactly as I described and exactly
as would be expected. We work with real hardware here
and I have to test it as part of my job.

And, FYI, I spend all my working time trying to get the
last iota of performance out of ix86 CPUS. Since I can
only read publicly available documentation, I have
to test code in actual operation.

The attached file shows that the Intel Pentium 4 runs
exactly as I described. Further, there is no difference in
the CPU clocks used when adding a constant to the stack-
pointer or using LEA.

It also shows that poping stack-data into the same register
twice, as you suggested, takes the same time as using a
different register.


Timer overhead = 88 CPU clocks
push 3, pop 3 = 12 CPU clocks
push 3, pop 2 = 12 CPU clocks
push 3, pop 1 = 12 CPU clocks
push 3, pop none using ADD = 8 CPU clocks
push 3, pop none using LEA = 8 CPU clocks
push 3, pop into same register = 12 CPU clocks

The code uses a separate assembly-language file so that
the 'C' compiler can't optimize-away what I am measuring.
It also saves and uses the shortest number of CPU cycles
so the code doesn't have to execute with the interrupts
OFF to get a stable reading.

>
> Learn about modern CPU's some day, and realize that cached accesses are
> fast, and pipeline stalls are relatively much more expensive.
>

That's what I do, and that's what I teach.

> Now, if it was uncached, you'd have a point.
>
> Also think about why
>
> 	call xxx
> 	jmp yy
>
> is often much faster than
>
> 	push $yy
> 	jmp xxx
>
> and other small interesting facts about how CPU's actually work these
> days.
>
> 		Linus
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

[-- Attachment #2: Type: APPLICATION/x-gzip, Size: 6806 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:42                         ` Linus Torvalds
  2004-10-29 18:54                           ` Linus Torvalds
@ 2004-10-30  3:35                           ` Jeff Garzik
  1 sibling, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-30  3:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Richard Henderson, Kernel Mailing List, Andi Kleen,
	Andrew Morton, Jan Hubicka

Linus Torvalds wrote:
> Anyway, it's quite likely that for several CPU's the fastest sequence ends 
> up actually being 
> 
> 	movl 4(%esp),%ecx
> 	movl 8(%esp),%edx
> 	movl 12(%esp),%eax
> 	addl $16,%esp
> 
> which is also one of the biggest alternatives.


That's how I'm coding the sparse "compiler backend"...  the mov's and 
add's tend to be tiny instructions (i-cache friendly), and you can often 
issue a bunch of them through multiple pipes/ports.

	Jeff



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 23:50                               ` dean gaudet
@ 2004-10-30  0:15                                 ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-30  0:15 UTC (permalink / raw)
  To: dean gaudet
  Cc: Andreas Steinmetz, linux-os, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, dean gaudet wrote:
> 
> for p4 model 0 through 2 it was faster to avoid lea and shl and generate 
> code like:
> 
> 	add %ebx,%ebx
> 	add %ebx,%ebx
> 	add %ebx,%ebx
> 	add %ebx,%ebx

I think that is true only for the lea's that have a shifted input. The
weakness of the original P4 is its shifter, not lea itself. And for a
simple lea like 4(%esp), it's likely no worse than a regular "add", and
there lea has the advantage that you can put the result in another
register, which can be advantageous in other circumstances.

So lea actually _is_ useful for doing adds, in many cases. Of course, on
older CPU's you'll see the effect of the address generation adder being
one cycle "off" (earlier) the regular ALU execution unit, so lea often
causes AGI stalls.  I don't think this is an issue on the P6 or P4 because 
of how they actually end up implementing the lea in the regular ALU path. 

How the hell did we get to worrying about this in the first place?

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:40                             ` Andreas Steinmetz
  2004-10-29 19:56                               ` Linus Torvalds
@ 2004-10-29 23:50                               ` dean gaudet
  2004-10-30  0:15                                 ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-10-29 23:50 UTC (permalink / raw)
  To: Andreas Steinmetz
  Cc: Linus Torvalds, linux-os, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Andreas Steinmetz wrote:

> > On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> > > Sample quote from said manual (P/N 248966-05):
> > > "Use the lea instruction and the full range of addressing modes to do
> > > address calculation"
...
> Some more data from said manual (lea is better on P3 and the same as add on
> P4):

you really need to understand intel optimisation guides.  it helps to diff 
them over time to see the types of things that go in and out of fashion.

> I don't know about P4 internals but let me make some guess:
> There's lot of software around that needs to run on older processors where lea
> has quite some performance advantage. Thus I would guess that the P4 design
> respects this by handling lea x(esp),esp efficiently.

your guess is generally wrong... try measuring it.

for p4 model 0 through 2 it was faster to avoid lea and shl and generate 
code like:

	add %ebx,%ebx
	add %ebx,%ebx
	add %ebx,%ebx
	add %ebx,%ebx

which would complete in 2 cycles, compared to 4 cycles for lea or a shift.  
but that crap doesn't apply to any other x86 (except efficeon which 
notices this crud and converts it to its own optimal sequence).

p4 model 2 is probably way more common than p4 model 3 still.

you also need to be aware of k7/k8.  AMD has their own optimisation guide 
(i'm too lazy to find url/#).  but the important point for lea and AMD is 
that it is a 2 cycle latency operation, and add is 1 cycle.

but you know what?  we can talk about what the optimization guides say 
until we're blue... the only thing which matters is experience.  go 
measure it.  (i've measured a bazillion things like this.)

use pop, don't use lea to modify esp.

-dean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:06                       ` Linus Torvalds
  2004-10-29 18:39                         ` linux-os
  2004-10-29 18:58                         ` Andreas Steinmetz
@ 2004-10-29 23:37                         ` dean gaudet
  2 siblings, 0 replies; 99+ messages in thread
From: dean gaudet @ 2004-10-29 23:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Andreas Steinmetz, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

On Fri, 29 Oct 2004, Linus Torvalds wrote:

> On Fri, 29 Oct 2004, linux-os wrote:
> > > with the following:
> > >
> > > leal 4(%esp),%esp
> > 
> > Probably so because I'm pretty certain that the 'pop' (a memory
> > access) is not going to be faster than a simple register operation.
> 
> Bzzt, wrong answer.
> 
> It's not "simple register operation". It's really about the fact that 
> modern CPU's are smarter - yet dumber - then you think. They do things 
> like speculate the value of %esp in order to avoid having to calculate it, 
> and it's entirely possible that "pop" is much faster, simply because I 
> guarantee you that a CPU will speculate %esp correctly across a "pop", but 
> the same is not necessarily true for "lea %esp".
> 
> Somebody should check what the Pentium M does. It might just notice that 
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely 
> possible that lea will confuse its stack engine logic and cause 
> stack-related address generation stalls..

it's worse than that in general -- lea typically goes through the AGU 
which has either less throughput or longer latency than the ALUs... 
depending on which x86en.  it's 4 cycles for a lea on p4, vs. 1 for a pop.  
it's 2 cycles for a lea on k8 vs. 1 for a pop.

use pop.

-dean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:56                               ` Linus Torvalds
@ 2004-10-29 22:07                                 ` Jeff Garzik
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-29 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Steinmetz, linux-os, Kernel Mailing List,
	Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka

Linus Torvalds wrote:
> 
> 	popl %eax
> 	popl %ecx
> 
> should one cycle on a Pentium. I pretty much _guarantee_ that
> 
> 	lea 4(%esp),%esp
> 	popl %ecx
> 
> takes longer, since they have a data dependency on %esp that is hard to 
> break (the P4 trace-cache _may_ be able to break it, but the only CPU that 
> I think is likely to break it is actually the Transmeta CPU's, which did 
> that kind of thing by default and _will_ parallelise the two, and even 
> combine the stack offsetting into one single micro-op).


One of my favorite "optimizing for Pentium" docs is

	http://www.agner.org/assem/pentopt.pdf
		from
	http://www.agner.org/assem/

which is current through newer P4's AFAICS.

It notes on the P4 specifically that LEA is split into additions and 
shifts.  Not sure what it does on the P3, but I bet it generates more 
uops in addition to the data dependency.

	Jeff



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:20                     ` Linus Torvalds
  2004-10-29 19:26                       ` Linus Torvalds
@ 2004-10-29 21:03                       ` Linus Torvalds
  1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 21:03 UTC (permalink / raw)
  To: linux-os
  Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Linus Torvalds wrote:
> 
> Here's a totally untested patch to make the semaphores use "fastcall" 
> instead of "asmlinkage"

Ok, I tested it, looked through the assembly code, and did a general size 
comparison. Everything looks good, and it should fix the problem that 
caused this discussion. Checked in.

The patch actually improves code generation by moving the failure case
argument generation _into_ the failure case: this makes the inline asm one
instruction longer, but it means that the fastpath is often one
instruction shorter. In fact, the fastpath is usually improved even _more_
than that, because gcc does sucketh at generating code that uses fixed
registers (ie the old code often caused gcc to first generate the value
into another register, and then _move_ it into %eax, rather than just
generating it into %eax in the first place).

My test-kernel shrunk by a whopping 2kB in size from this change.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:40                             ` Andreas Steinmetz
@ 2004-10-29 19:56                               ` Linus Torvalds
  2004-10-29 22:07                                 ` Jeff Garzik
  2004-10-29 23:50                               ` dean gaudet
  1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:56 UTC (permalink / raw)
  To: Andreas Steinmetz
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> 
> If you still believe in features I can't find any manufacturer 
> documentation for, well, you're Linus so it's your decision.

It's not that I'm Linus. It's that I am apparently better informed than
you are, and the numbers you are looking at are irrelevant. For example,
have you even _looked_ at the Pentium M stack engine documentation, which
is what this whole argument is all about?

And the documentation you look at is not revelant. For example, when you
look at the latency of "pop", who _cares_? That's the latency to use the
data, and has no meaning, since in this case we don't actually ever use
it. So what matters is other things entirely, like how well the 
instructions can run in parallell.

Try it. 

	popl %eax
	popl %ecx

should one cycle on a Pentium. I pretty much _guarantee_ that

	lea 4(%esp),%esp
	popl %ecx

takes longer, since they have a data dependency on %esp that is hard to 
break (the P4 trace-cache _may_ be able to break it, but the only CPU that 
I think is likely to break it is actually the Transmeta CPU's, which did 
that kind of thing by default and _will_ parallelise the two, and even 
combine the stack offsetting into one single micro-op).

So my argument is that "popl" is smaller, and I doubt you can find a
machine where it's actually slower (most will take two cycles). And I am
pretty confident that I can find machines where it is faster (ie regular
Pentium).

Not that any of this matters, since there's a patch that makes all of this 
moot. If it works.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:15                           ` Linus Torvalds
@ 2004-10-29 19:40                             ` Andreas Steinmetz
  2004-10-29 19:56                               ` Linus Torvalds
  2004-10-29 23:50                               ` dean gaudet
  0 siblings, 2 replies; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

Linus Torvalds wrote:
> 
> On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> 
> 
>>Linus Torvalds wrote:
>>
>>>Somebody should check what the Pentium M does. It might just notice that 
>>>"lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely 
>>>possible that lea will confuse its stack engine logic and cause 
>>>stack-related address generation stalls..
>>
>>Now especially Intel tells everybody in their Pentium Optimization 
>>manuals to *use* lea whereever possible as this operation doesn't depend 
>>on the ALU and is processed in other parts of the CPU.
>>
>>Sample quote from said manual (P/N 248966-05):
>>"Use the lea instruction and the full range of addressing modes to do 
>>address calculation"
> 
> 
> Does it say this about %esp?
> 
> The stack pointer is SPECIAL, guys. It's special exactly because there is
> potentially extra hardware in CPU's that track its value _independently_
> of the actual physical register.

It doesn't say anything about esp being specially treated by the 
underlying hardware as far as I can see. Thus either you know details 
about the cpu not being publically available or you're speculating about 
  undocumented features.

Some more data from said manual (lea is better on P3 and the same as add 
on P4):

Instruction	Latency		Execution Unit
ADD/SUB:	0.5		ALU
POP		1.5		MEM_LOAD,ALU

Now, a P4 had two ALUs (Ports 0 and 1) but only one MEM_LOAD Unit (Port 
2). So after all you will be stalled more likely by an additional pop 
instruction than by lea/add. I don't know about P4 internals but let me 
make some guess: There's lot of software around that needs to run on 
older processors where lea has quite some performance advantage. Thus I 
would guess that the P4 design respects this by handling lea x(esp),esp 
efficiently.

If you still believe in features I can't find any manufacturer 
documentation for, well, you're Linus so it's your decision.
-- 
Andreas Steinmetz                       SPAMmers use robotrap@domdv.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 19:20                     ` Linus Torvalds
@ 2004-10-29 19:26                       ` Linus Torvalds
  2004-10-29 21:03                       ` Linus Torvalds
  1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:26 UTC (permalink / raw)
  To: linux-os
  Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Linus Torvalds wrote:
> 
> 
> Here's a totally untested patch to make the semaphores use "fastcall" 
> instead of "asmlinkage", and thus pass the argument in %eax instead of on 
> the stack. Does it work? I have no idea. If it does, it should fix the 
> particular bug that started this thread..

Oh, sorry, please remove this part, it was totally unintentional (I _told_ 
you this wasn't tested):

> --- 1.4/include/asm-i386/linkage.h	2004-10-16 18:24:37 -07:00
> +++ edited/include/asm-i386/linkage.h	2004-10-29 11:32:18 -07:00
> @@ -1,7 +1,7 @@
>  #ifndef __ASM_LINKAGE_H
>  #define __ASM_LINKAGE_H
>  
> -#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
> +#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))
>  #define FASTCALL(x)	x __attribute__((regparm(3)))
>  #define fastcall	__attribute__((regparm(3)))
>  

We're not making all asmlinkage things fastcalls here, we're only doing 
the semaphores..

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 17:22                   ` linux-os
  2004-10-29 17:55                     ` Richard Henderson
@ 2004-10-29 19:20                     ` Linus Torvalds
  2004-10-29 19:26                       ` Linus Torvalds
  2004-10-29 21:03                       ` Linus Torvalds
  1 sibling, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:20 UTC (permalink / raw)
  To: linux-os
  Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



Here's a totally untested patch to make the semaphores use "fastcall" 
instead of "asmlinkage", and thus pass the argument in %eax instead of on 
the stack. Does it work? I have no idea. If it does, it should fix the 
particular bug that started this thread..

			Linus

---
===== arch/i386/kernel/semaphore.c 1.10 vs edited =====
--- 1.10/arch/i386/kernel/semaphore.c	2004-04-12 10:53:59 -07:00
+++ edited/arch/i386/kernel/semaphore.c	2004-10-29 12:19:22 -07:00
@@ -49,12 +49,12 @@
  *    we cannot lose wakeup events.
  */
 
-asmlinkage void __up(struct semaphore *sem)
+fastcall void __up(struct semaphore *sem)
 {
 	wake_up(&sem->wait);
 }
 
-asmlinkage void __sched __down(struct semaphore * sem)
+fastcall void __sched __down(struct semaphore * sem)
 {
 	struct task_struct *tsk = current;
 	DECLARE_WAITQUEUE(wait, tsk);
@@ -91,7 +91,7 @@
 	tsk->state = TASK_RUNNING;
 }
 
-asmlinkage int __sched __down_interruptible(struct semaphore * sem)
+fastcall int __sched __down_interruptible(struct semaphore * sem)
 {
 	int retval = 0;
 	struct task_struct *tsk = current;
@@ -154,7 +154,7 @@
  * single "cmpxchg" without failure cases,
  * but then it wouldn't work on a 386.
  */
-asmlinkage int __down_trylock(struct semaphore * sem)
+fastcall int __down_trylock(struct semaphore * sem)
 {
 	int sleepers;
 	unsigned long flags;
@@ -183,9 +183,9 @@
  * need to convert that sequence back into the C sequence when
  * there is contention on the semaphore.
  *
- * %ecx contains the semaphore pointer on entry. Save the C-clobbered
- * registers (%eax, %edx and %ecx) except %eax when used as a return
- * value..
+ * %eax contains the semaphore pointer on entry. Save the C-clobbered
+ * registers (%eax, %edx and %ecx) except %eax whish is either a return
+ * value or just clobbered..
  */
 asm(
 ".section .sched.text\n"
@@ -196,13 +196,11 @@
 	"pushl %ebp\n\t"
 	"movl  %esp,%ebp\n\t"
 #endif
-	"pushl %eax\n\t"
 	"pushl %edx\n\t"
 	"pushl %ecx\n\t"
 	"call __down\n\t"
 	"popl %ecx\n\t"
 	"popl %edx\n\t"
-	"popl %eax\n\t"
 #if defined(CONFIG_FRAME_POINTER)
 	"movl %ebp,%esp\n\t"
 	"popl %ebp\n\t"
@@ -257,13 +255,11 @@
 ".align 4\n"
 ".globl __up_wakeup\n"
 "__up_wakeup:\n\t"
-	"pushl %eax\n\t"
 	"pushl %edx\n\t"
 	"pushl %ecx\n\t"
 	"call __up\n\t"
 	"popl %ecx\n\t"
 	"popl %edx\n\t"
-	"popl %eax\n\t"
 	"ret"
 );
 
===== include/asm-i386/linkage.h 1.4 vs edited =====
--- 1.4/include/asm-i386/linkage.h	2004-10-16 18:24:37 -07:00
+++ edited/include/asm-i386/linkage.h	2004-10-29 11:32:18 -07:00
@@ -1,7 +1,7 @@
 #ifndef __ASM_LINKAGE_H
 #define __ASM_LINKAGE_H
 
-#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
+#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))
 #define FASTCALL(x)	x __attribute__((regparm(3)))
 #define fastcall	__attribute__((regparm(3)))
 
===== include/asm-i386/semaphore.h 1.9 vs edited =====
--- 1.9/include/asm-i386/semaphore.h	2004-08-27 00:02:38 -07:00
+++ edited/include/asm-i386/semaphore.h	2004-10-29 12:06:48 -07:00
@@ -87,15 +87,15 @@
 	sema_init(sem, 0);
 }
 
-asmlinkage void __down_failed(void /* special register calling convention */);
-asmlinkage int  __down_failed_interruptible(void  /* params in registers */);
-asmlinkage int  __down_failed_trylock(void  /* params in registers */);
-asmlinkage void __up_wakeup(void /* special register calling convention */);
-
-asmlinkage void __down(struct semaphore * sem);
-asmlinkage int  __down_interruptible(struct semaphore * sem);
-asmlinkage int  __down_trylock(struct semaphore * sem);
-asmlinkage void __up(struct semaphore * sem);
+fastcall void __down_failed(void /* special register calling convention */);
+fastcall int  __down_failed_interruptible(void  /* params in registers */);
+fastcall int  __down_failed_trylock(void  /* params in registers */);
+fastcall void __up_wakeup(void /* special register calling convention */);
+
+fastcall void __down(struct semaphore * sem);
+fastcall int  __down_interruptible(struct semaphore * sem);
+fastcall int  __down_trylock(struct semaphore * sem);
+fastcall void __up(struct semaphore * sem);
 
 /*
  * This is ugly, but we want the default case to fall through.
@@ -111,12 +111,13 @@
 		"js 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
-		"2:\tcall __down_failed\n\t"
+		"2:\tlea %0,%%eax\n\t"
+		"call __down_failed\n\t"
 		"jmp 1b\n"
 		LOCK_SECTION_END
 		:"=m" (sem->count)
-		:"c" (sem)
-		:"memory");
+		:
+		:"memory","ax");
 }
 
 /*
@@ -135,11 +136,12 @@
 		"xorl %0,%0\n"
 		"1:\n"
 		LOCK_SECTION_START("")
-		"2:\tcall __down_failed_interruptible\n\t"
+		"2:\tlea %1,%%eax\n\t"
+		"call __down_failed_interruptible\n\t"
 		"jmp 1b\n"
 		LOCK_SECTION_END
 		:"=a" (result), "=m" (sem->count)
-		:"c" (sem)
+		:
 		:"memory");
 	return result;
 }
@@ -159,11 +161,12 @@
 		"xorl %0,%0\n"
 		"1:\n"
 		LOCK_SECTION_START("")
-		"2:\tcall __down_failed_trylock\n\t"
+		"2:\tlea %1,%%eax\n\t"
+		"call __down_failed_trylock\n\t"
 		"jmp 1b\n"
 		LOCK_SECTION_END
 		:"=a" (result), "=m" (sem->count)
-		:"c" (sem)
+		:
 		:"memory");
 	return result;
 }
@@ -182,13 +185,14 @@
 		"jle 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
-		"2:\tcall __up_wakeup\n\t"
+		"2:\tlea %0,%%eax\n\t"
+		"call __up_wakeup\n\t"
 		"jmp 1b\n"
 		LOCK_SECTION_END
 		".subsection 0\n"
 		:"=m" (sem->count)
-		:"c" (sem)
-		:"memory");
+		:
+		:"memory","ax");
 }
 
 #endif
===== include/linux/spinlock.h 1.32 vs edited =====
--- 1.32/include/linux/spinlock.h	2004-10-24 16:24:20 -07:00
+++ edited/include/linux/spinlock.h	2004-10-29 12:08:14 -07:00
@@ -27,7 +27,7 @@
         extra                                   \
         ".ifndef " LOCK_SECTION_NAME "\n\t"     \
         LOCK_SECTION_NAME ":\n\t"               \
-        ".endif\n\t"
+        ".endif\n"
 
 #define LOCK_SECTION_END                        \
         ".previous\n\t"

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:58                         ` Andreas Steinmetz
@ 2004-10-29 19:15                           ` Linus Torvalds
  2004-10-29 19:40                             ` Andreas Steinmetz
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:15 UTC (permalink / raw)
  To: Andreas Steinmetz
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Andreas Steinmetz wrote:

> Linus Torvalds wrote:
> > Somebody should check what the Pentium M does. It might just notice that 
> > "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely 
> > possible that lea will confuse its stack engine logic and cause 
> > stack-related address generation stalls..
> 
> Now especially Intel tells everybody in their Pentium Optimization 
> manuals to *use* lea whereever possible as this operation doesn't depend 
> on the ALU and is processed in other parts of the CPU.
> 
> Sample quote from said manual (P/N 248966-05):
> "Use the lea instruction and the full range of addressing modes to do 
> address calculation"

Does it say this about %esp?

The stack pointer is SPECIAL, guys. It's special exactly because there is
potentially extra hardware in CPU's that track its value _independently_
of the actual physical register.

Just for fun, google for 'x86 "stack engine"', and you'll hit for example 
http://arstechnica.com/articles/paedia/cpu/pentium-m.ars/5 which talks 
about this and perhaps explains it in ways that I apparently haven't been 
able to.

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:39                         ` linux-os
@ 2004-10-29 19:12                           ` Linus Torvalds
  2004-11-01  1:31                             ` linux-os
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:12 UTC (permalink / raw)
  To: linux-os
  Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, linux-os wrote:
> 
> Linus, there is no way in hell that you are going to move
> a value from memory into a register (pop ecx) faster than
> you are going to do anything to the stack-pointer or
> any other register.

Sorry, but you're wrong.

Learn about modern CPU's some day, and realize that cached accesses are 
fast, and pipeline stalls are relatively much more expensive.

Now, if it was uncached, you'd have a point.

Also think about why

	call xxx
	jmp yy

is often much faster than

	push $yy
	jmp xxx

and other small interesting facts about how CPU's actually work these 
days.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:06                       ` Linus Torvalds
  2004-10-29 18:39                         ` linux-os
@ 2004-10-29 18:58                         ` Andreas Steinmetz
  2004-10-29 19:15                           ` Linus Torvalds
  2004-10-29 23:37                         ` dean gaudet
  2 siblings, 1 reply; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

Linus Torvalds wrote:
> Somebody should check what the Pentium M does. It might just notice that 
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely 
> possible that lea will confuse its stack engine logic and cause 
> stack-related address generation stalls..

Now especially Intel tells everybody in their Pentium Optimization 
manuals to *use* lea whereever possible as this operation doesn't depend 
on the ALU and is processed in other parts of the CPU.

Sample quote from said manual (P/N 248966-05):
"Use the lea instruction and the full range of addressing modes to do 
address calculation"

I would guess Intel would add caveats about such stalls in this manual 
if there would be any.
-- 
Andreas Steinmetz                       SPAMmers use robotrap@domdv.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:42                         ` Linus Torvalds
@ 2004-10-29 18:54                           ` Linus Torvalds
  2004-10-30  3:35                           ` Jeff Garzik
  1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:54 UTC (permalink / raw)
  To: linux-os
  Cc: Richard Henderson, Kernel Mailing List, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, Linus Torvalds wrote:
> 
> Anyway, making "asmlinkage" imply "regparm(3)" would make the whole 
> discussion moot, so I'm wondering if anybody has the patches to try it 
> out? It requires pretty big changes to all the x86 asm code, but I do know 
> that people _had_ patches like that at least long ago (from when people 
> like Jan were playing with -mregaparm=3 originally). Maybe some of them 
> still exist..

Looking at just doing this for the semaphore code, I hit on the fact that
we already do this for the rwsem's.. So changing just the regular
semaphore code to use "fastcall" should fix this particular bug, but I'm
still interested in hearing whether somebody has a patch for the system
calls and faults too?

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:17                       ` linux-os
@ 2004-10-29 18:42                         ` Linus Torvalds
  2004-10-29 18:54                           ` Linus Torvalds
  2004-10-30  3:35                           ` Jeff Garzik
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:42 UTC (permalink / raw)
  To: linux-os
  Cc: Richard Henderson, Kernel Mailing List, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, linux-os wrote:

> On Fri, 29 Oct 2004, Richard Henderson wrote:
> >
> > Also not necessarily correct.  Intel cpus special-case pop
> > instructions; two pops can be dual issued, whereas a different
> > kind of stack pointer manipulation will not.
> >
> 
> Then I guess the Intel documentation is incorrect, too.

Where?

It definitely depends on the CPU. Some CPU's dual-issue pops, some don't.

I think the Pentium can dual-issue, while the PPro/P4 does not. And AMD
has some other rules, and I think older ones dual-issue stack accesses
only if esp doesn't change. Haven't looked at K8 rules.

And Pentium M is to some degree more interesting than P4 and Ppro, because
it's apparently the architecture Intel is going forward with for the
future of x86, and it is a "improved PPro" core that has a special stack
engine, iirc.

Anyway, it's quite likely that for several CPU's the fastest sequence ends 
up actually being 

	movl 4(%esp),%ecx
	movl 8(%esp),%edx
	movl 12(%esp),%eax
	addl $16,%esp

which is also one of the biggest alternatives.

Anyway, making "asmlinkage" imply "regparm(3)" would make the whole 
discussion moot, so I'm wondering if anybody has the patches to try it 
out? It requires pretty big changes to all the x86 asm code, but I do know 
that people _had_ patches like that at least long ago (from when people 
like Jan were playing with -mregaparm=3 originally). Maybe some of them 
still exist..

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:06                       ` Linus Torvalds
@ 2004-10-29 18:39                         ` linux-os
  2004-10-29 19:12                           ` Linus Torvalds
  2004-10-29 18:58                         ` Andreas Steinmetz
  2004-10-29 23:37                         ` dean gaudet
  2 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka

On Fri, 29 Oct 2004, Linus Torvalds wrote:

>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>> with the following:
>>>
>>> leal 4(%esp),%esp
>>
>> Probably so because I'm pretty certain that the 'pop' (a memory
>> access) is not going to be faster than a simple register operation.
>
> Bzzt, wrong answer.
>
> It's not "simple register operation". It's really about the fact that
> modern CPU's are smarter - yet dumber - then you think. They do things
> like speculate the value of %esp in order to avoid having to calculate it,
> and it's entirely possible that "pop" is much faster, simply because I
> guarantee you that a CPU will speculate %esp correctly across a "pop", but
> the same is not necessarily true for "lea %esp".
>
> Somebody should check what the Pentium M does. It might just notice that
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> possible that lea will confuse its stack engine logic and cause
> stack-related address generation stalls..
>
> 		Linus


Linus, there is no way in hell that you are going to move
a value from memory into a register (pop ecx) faster than
you are going to do anything to the stack-pointer or
any other register. The register operations operate
at the internal CPU clock-rate (GHz). The memory operations
operate at the front-side bus rate (MHz), and the data-
movement must actually occur before anything else can.
In other words, with stack operations, modern CPUs will
stall until the operation has completed.

Using the rdtsc, on this computer, both of the stack-pointer
additions (leal or add) take 6 +/- 2 clocks. The pop ecx
takes 12 +/- 3 clocks.

Things that should take only one clock, according to the
documentation, take 4 or 5 even when subtracting-out
the time necessary to do the rdtsc, because this machine
(and probably others) is very noisy, so all I can state
with certainty is that the pop from the stack takes longer.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 14:46                 ` Linus Torvalds
                                     ` (3 preceding siblings ...)
  2004-10-29 17:57                   ` Richard Henderson
@ 2004-10-29 18:37                   ` Gabriel Paubert
  4 siblings, 0 replies; 99+ messages in thread
From: Gabriel Paubert @ 2004-10-29 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

On Fri, Oct 29, 2004 at 07:46:06AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 29 Oct 2004, linux-os wrote:
> > 
> > Linus, please check this out.
> 
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a 
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx 
> obviously gets immediately overwritten by the next popl).

Rather popl %eax or popl %edx then, a basic and MMX Pentium 
cannot pair:

	popl %ecx
	popl %ecx

for the simple reason that two instructions that have the
same destination register can't be paired.

OTOH, the other argument about reading or not memory in
this thread are a red herring. An additional memory read 
is cheap for data that is guaranteed to be in a cache line 
used by adjacent (in time) instructions.

Otherwise regparm(1) might even be better, movl %ecx,%eax is
the same size as push+pop, is faster, and may even reduce
stack usage by 4 bytes.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 18:18                     ` Linus Torvalds
@ 2004-10-29 18:35                       ` Richard Henderson
  0 siblings, 0 replies; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, linux-os, Kernel Mailing List, Andrew Morton, Jan Hubicka

On Fri, Oct 29, 2004 at 11:18:33AM -0700, Linus Torvalds wrote:
> What's happens if there are more arguments than three? It happens for 
> several system calls - does gcc still consider the stack part of the thing 
> to be owned by the callee?

Yes.


r~

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 15:11                   ` Andi Kleen
@ 2004-10-29 18:18                     ` Linus Torvalds
  2004-10-29 18:35                       ` Richard Henderson
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andrew Morton,
	Jan Hubicka



On Fri, 29 Oct 2004, Andi Kleen wrote:
>
> > Richard, Jan, Andi? Or does it already exist somewhere?
> 
> How about just using __attribute__((regparm(1)))  ?  Then the
> problem doesn't appear. 

Yes, we could use regparm for all assembly. Right now "asmlinkage" 
actually _disables_ regparm so that we always have the same calling 
convention for assembly regardless of whether the rest of the kernel is 
compiled with regparm or not, but we could certainly change that 

	#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

to use "regparm(3)" instead. I guess it's stable these days, since we use 
it for FASTCALL() and friends too.

> Would be faster too. It should work reliable on all supported compilers.

What's happens if there are more arguments than three? It happens for 
several system calls - does gcc still consider the stack part of the thing 
to be owned by the callee?

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 17:55                     ` Richard Henderson
@ 2004-10-29 18:17                       ` linux-os
  2004-10-29 18:42                         ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 18:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Linus Torvalds, Kernel Mailing List, Andi Kleen, Andrew Morton,
	Jan Hubicka

On Fri, 29 Oct 2004, Richard Henderson wrote:

> On Fri, Oct 29, 2004 at 01:22:52PM -0400, linux-os wrote:
>> Here's a version that uses `leal 4(esp), esp` to add
>> 4 to the stack-pointer. Since this 'address-calculation`
>> is done in an different portion of Intel CPUs....
>
> Incorrect, at least i686 and beyond.  These interpret to the
> same micro-ops.
>
>> The 'pop ecx' would access memory and, therefore be slower than
>> simple register operations.
>
> Also not necessarily correct.  Intel cpus special-case pop
> instructions; two pops can be dual issued, whereas a different
> kind of stack pointer manipulation will not.
>

Then I guess the Intel documentation is incorrect, too.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 17:08                     ` linux-os
@ 2004-10-29 18:06                       ` Linus Torvalds
  2004-10-29 18:39                         ` linux-os
                                           ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:06 UTC (permalink / raw)
  To: linux-os
  Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, linux-os wrote:
> > with the following:
> >
> > leal 4(%esp),%esp
> 
> Probably so because I'm pretty certain that the 'pop' (a memory
> access) is not going to be faster than a simple register operation.

Bzzt, wrong answer.

It's not "simple register operation". It's really about the fact that 
modern CPU's are smarter - yet dumber - then you think. They do things 
like speculate the value of %esp in order to avoid having to calculate it, 
and it's entirely possible that "pop" is much faster, simply because I 
guarantee you that a CPU will speculate %esp correctly across a "pop", but 
the same is not necessarily true for "lea %esp".

Somebody should check what the Pentium M does. It might just notice that 
"lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely 
possible that lea will confuse its stack engine logic and cause 
stack-related address generation stalls..

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 14:46                 ` Linus Torvalds
                                     ` (2 preceding siblings ...)
  2004-10-29 17:22                   ` linux-os
@ 2004-10-29 17:57                   ` Richard Henderson
  2004-10-29 18:37                   ` Gabriel Paubert
  4 siblings, 0 replies; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Andi Kleen, Andrew Morton, Jan Hubicka

On Fri, Oct 29, 2004 at 07:46:06AM -0700, Linus Torvalds wrote:
> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?  

Certainly we'd accept the feature, it's just a matter of 
doing the work.  


r~

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 17:22                   ` linux-os
@ 2004-10-29 17:55                     ` Richard Henderson
  2004-10-29 18:17                       ` linux-os
  2004-10-29 19:20                     ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 17:55 UTC (permalink / raw)
  To: linux-os
  Cc: Linus Torvalds, Kernel Mailing List, Andi Kleen, Andrew Morton,
	Jan Hubicka

On Fri, Oct 29, 2004 at 01:22:52PM -0400, linux-os wrote:
> Here's a version that uses `leal 4(esp), esp` to add
> 4 to the stack-pointer. Since this 'address-calculation`
> is done in an different portion of Intel CPUs....

Incorrect, at least i686 and beyond.  These interpret to the
same micro-ops.

> The 'pop ecx' would access memory and, therefore be slower than
> simple register operations.

Also not necessarily correct.  Intel cpus special-case pop
instructions; two pops can be dual issued, whereas a different
kind of stack pointer manipulation will not.


r~

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 14:46                 ` Linus Torvalds
  2004-10-29 15:11                   ` Andi Kleen
  2004-10-29 16:06                   ` Andreas Steinmetz
@ 2004-10-29 17:22                   ` linux-os
  2004-10-29 17:55                     ` Richard Henderson
  2004-10-29 19:20                     ` Linus Torvalds
  2004-10-29 17:57                   ` Richard Henderson
  2004-10-29 18:37                   ` Gabriel Paubert
  4 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-10-29 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

On Fri, 29 Oct 2004, Linus Torvalds wrote:

>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>
>> Linus, please check this out.
>
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx
> obviously gets immediately overwritten by the next popl).
>
> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?
> Richard, Jan, Andi? Or does it already exist somewhere?
>
> 		Linus
>

Here's a version that uses `leal 4(esp), esp` to add
4 to the stack-pointer. Since this 'address-calculation`
is done in an different portion of Intel CPUs, there
is some parallel operation that can occur. The 'pop ecx'
would access memory and, therefore be slower than
simple register operations.

FYI I'm running a kernel with this patch already.


--- linux-2.6.9/arch/i386/kernel/semaphore.c.orig	2004-10-29 13:00:17.961579368 -0400
+++ linux-2.6.9/arch/i386/kernel/semaphore.c	2004-10-29 13:03:35.046617888 -0400
@@ -198,9 +198,11 @@
  #endif
  	"pushl %eax\n\t"
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"		// Register to save
+	"pushl %ecx\n\t"		// Passed parameter
  	"call __down\n\t"
-	"popl %ecx\n\t"
+	"leal 0x04(%esp), %esp\t\n"	// Bypass corrupted parameter
+	"popl %ecx\n\t"			// Restore original
  	"popl %edx\n\t"
  	"popl %eax\n\t"
  #if defined(CONFIG_FRAME_POINTER)
@@ -220,9 +222,11 @@
  	"movl  %esp,%ebp\n\t"
  #endif
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"		// Save register
+	"pushl %ecx\n\t"		// Passed parameter
  	"call __down_interruptible\n\t"
-	"popl %ecx\n\t"
+	"leal 0x04(%esp), %esp\n\t"	// Bypass corrupted parameter
+	"popl %ecx\n\t"			// Restore register
  	"popl %edx\n\t"
  #if defined(CONFIG_FRAME_POINTER)
  	"movl %ebp,%esp\n\t"
@@ -241,9 +245,11 @@
  	"movl  %esp,%ebp\n\t"
  #endif
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"		// Save register
+	"pushl %ecx\n\t"		// Passed parameter
  	"call __down_trylock\n\t"
-	"popl %ecx\n\t"
+	"leal 0x04(%esp), %esp\n\t"	// Bypass corrupted parameter
+	"popl %ecx\n\t"			// Restore register
  	"popl %edx\n\t"
  #if defined(CONFIG_FRAME_POINTER)
  	"movl %ebp,%esp\n\t"
@@ -259,9 +265,11 @@
  "__up_wakeup:\n\t"
  	"pushl %eax\n\t"
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"		// Save register
+	"pushl %ecx\n\t"		// Passed parameter
  	"call __up\n\t"
-	"popl %ecx\n\t"
+	"leal 0x04(%esp), %esp\n\t"	// Bypass corrupted parameter
+	"popl %ecx\n\t"			// Restore register
  	"popl %edx\n\t"
  	"popl %eax\n\t"
  	"ret"



Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 16:06                   ` Andreas Steinmetz
@ 2004-10-29 17:08                     ` linux-os
  2004-10-29 18:06                       ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 17:08 UTC (permalink / raw)
  To: Andreas Steinmetz
  Cc: Linus Torvalds, Kernel Mailing List, Richard Henderson,
	Andi Kleen, Andrew Morton, Jan Hubicka

On Fri, 29 Oct 2004, Andreas Steinmetz wrote:

> Linus Torvalds wrote:
>> 
>> On Fri, 29 Oct 2004, linux-os wrote:
>> 
>>> Linus, please check this out.
>> 
>> 
>> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a 
>> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx 
>> obviously gets immediately overwritten by the next popl).
>
> Hmm, I didn't check the instruction length but modern CPUs usually work best 
> with the following:
>
> leal 4(%esp),%esp
>
> -- 
> Andreas Steinmetz                       SPAMmers use robotrap@domdv.de
>

Probably so because I'm pretty certain that the 'pop' (a memory
access) is not going to be faster than a simple register operation.

I'll make another patch and post it (if the machine will boot!)

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 14:46                 ` Linus Torvalds
  2004-10-29 15:11                   ` Andi Kleen
@ 2004-10-29 16:06                   ` Andreas Steinmetz
  2004-10-29 17:08                     ` linux-os
  2004-10-29 17:22                   ` linux-os
                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka

Linus Torvalds wrote:
> 
> On Fri, 29 Oct 2004, linux-os wrote:
> 
>>Linus, please check this out.
> 
> 
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a 
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx 
> obviously gets immediately overwritten by the next popl).

Hmm, I didn't check the instruction length but modern CPUs usually work 
best with the following:

leal 4(%esp),%esp

-- 
Andreas Steinmetz                       SPAMmers use robotrap@domdv.de

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 14:46                 ` Linus Torvalds
@ 2004-10-29 15:11                   ` Andi Kleen
  2004-10-29 18:18                     ` Linus Torvalds
  2004-10-29 16:06                   ` Andreas Steinmetz
                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-29 15:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-os, Kernel Mailing List, Richard Henderson, Andrew Morton,
	Jan Hubicka

> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?  
> Richard, Jan, Andi? Or does it already exist somewhere?

How about just using __attribute__((regparm(1)))  ?  Then the
problem doesn't appear. 
Would be faster too. It should work reliable on all supported compilers.

-Andi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Semaphore assembly-code bug
  2004-10-29 12:12               ` Semaphore assembly-code bug linux-os
@ 2004-10-29 14:46                 ` Linus Torvalds
  2004-10-29 15:11                   ` Andi Kleen
                                     ` (4 more replies)
  0 siblings, 5 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 14:46 UTC (permalink / raw)
  To: linux-os
  Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
	Andrew Morton, Jan Hubicka



On Fri, 29 Oct 2004, linux-os wrote:
> 
> Linus, please check this out.

Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a 
"popl %ecx", which is smaller and apparently faster on some CPU's (ecx 
obviously gets immediately overwritten by the next popl).

Btw, this is another case where we _really_ want "asmlinkage" to mean that
the compiler does not own the argument stack. Is there any chance of
getting a function attribute like that into future versions of gcc?  
Richard, Jan, Andi? Or does it already exist somewhere?

		Linus

--- saved for gcc people commentary ---
>
> asmlinkage void __up(struct semaphore *sem)
> {
>      wake_up(&sem->wait);
> }
> 
> This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
> It this case, the value of 'sem' is destroyed which means that
> certain assembly-language helper functions no longer work.
> 
> This was discovered by Aleksey Gorelov <Aleksey_Gorelov@Phoenix.com>.
> 
> This patch fixes it, but I think somebody may need to rework
> the semaphore code to eliminate the assembly because the newer
> compilers are not consistent in their handling of passed parameters
> so some assembly optimization may no longer be possible.
> 
> 
> --- linux-2.6.9/arch/i386/kernel/semaphore.c.orig	2004-08-14 01:36:56.000000000 -0400
> +++ linux-2.6.9/arch/i386/kernel/semaphore.c	2004-10-19 08:06:15.000000000 -0400
> @@ -198,9 +198,11 @@
>   #endif
>   	"pushl %eax\n\t"
>   	"pushl %edx\n\t"
> -	"pushl %ecx\n\t"
> +	"pushl %ecx\n\t"	// Register to save
> +	"pushl %ecx\n\t"	// Passed parameter
>   	"call __down\n\t"
> -	"popl %ecx\n\t"
> +	"addl $0x04, %esp\t\n"	// Bypass corrupted parameter
> +	"popl %ecx\n\t"		// Restore original
>   	"popl %edx\n\t"
>   	"popl %eax\n\t"
>   #if defined(CONFIG_FRAME_POINTER)
> @@ -220,9 +222,11 @@
>   	"movl  %esp,%ebp\n\t"
>   #endif
>   	"pushl %edx\n\t"
> -	"pushl %ecx\n\t"
> +	"pushl %ecx\n\t"	// Save register
> +	"pushl %ecx\n\t"	// Passed parameter
>   	"call __down_interruptible\n\t"
> -	"popl %ecx\n\t"
> +	"addl $0x04, %esp\n\t"	// Bypass corrupted parameter
> +	"popl %ecx\n\t"		// Restore register
>   	"popl %edx\n\t"
>   #if defined(CONFIG_FRAME_POINTER)
>   	"movl %ebp,%esp\n\t"
> @@ -241,9 +245,11 @@
>   	"movl  %esp,%ebp\n\t"
>   #endif
>   	"pushl %edx\n\t"
> -	"pushl %ecx\n\t"
> +	"pushl %ecx\n\t"		// Save register
> +	"pushl %ecx\n\t"		// Passed parameter
>   	"call __down_trylock\n\t"
> -	"popl %ecx\n\t"
> +	"addl $0x04, %esp\n\t"		// Bypass corrupted parameter
> +	"popl %ecx\n\t"			// Restore register
>   	"popl %edx\n\t"
>   #if defined(CONFIG_FRAME_POINTER)
>   	"movl %ebp,%esp\n\t"
> @@ -259,9 +265,11 @@
>   "__up_wakeup:\n\t"
>   	"pushl %eax\n\t"
>   	"pushl %edx\n\t"
> -	"pushl %ecx\n\t"
> +	"pushl %ecx\n\t"	// Save register
> +	"pushl %ecx\n\t"	// Passed parameter
>   	"call __up\n\t"
> -	"popl %ecx\n\t"
> +	"addl $0x04, %esp\n\t"	// Bypass corrupted parameter
> +	"popl %ecx\n\t"		// Restore register
>   	"popl %edx\n\t"
>   	"popl %eax\n\t"
>   	"ret"
> 
> 
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
>   Notice : All mail here is now cached for review by John Ashcroft.
>                   98.36% of all statistics are fiction.
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Semaphore assembly-code bug
  2004-10-20 11:49             ` Richard B. Johnson
@ 2004-10-29 12:12               ` linux-os
  2004-10-29 14:46                 ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 12:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List


Linus, please check this out.

This 'C' compiler destroys parameters passed to functions
even though the code does not alter that parameter.

gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The 'C' compiler is provided in a recent Fedora distribution.

For instance:

asmlinkage void __up(struct semaphore *sem)
{
     wake_up(&sem->wait);
}

This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
It this case, the value of 'sem' is destroyed which means that
certain assembly-language helper functions no longer work.

This was discovered by Aleksey Gorelov <Aleksey_Gorelov@Phoenix.com>.

This patch fixes it, but I think somebody may need to rework
the semaphore code to eliminate the assembly because the newer
compilers are not consistent in their handling of passed parameters
so some assembly optimization may no longer be possible.


--- linux-2.6.9/arch/i386/kernel/semaphore.c.orig	2004-08-14 01:36:56.000000000 -0400
+++ linux-2.6.9/arch/i386/kernel/semaphore.c	2004-10-19 08:06:15.000000000 -0400
@@ -198,9 +198,11 @@
  #endif
  	"pushl %eax\n\t"
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"	// Register to save
+	"pushl %ecx\n\t"	// Passed parameter
  	"call __down\n\t"
-	"popl %ecx\n\t"
+	"addl $0x04, %esp\t\n"	// Bypass corrupted parameter
+	"popl %ecx\n\t"		// Restore original
  	"popl %edx\n\t"
  	"popl %eax\n\t"
  #if defined(CONFIG_FRAME_POINTER)
@@ -220,9 +222,11 @@
  	"movl  %esp,%ebp\n\t"
  #endif
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"	// Save register
+	"pushl %ecx\n\t"	// Passed parameter
  	"call __down_interruptible\n\t"
-	"popl %ecx\n\t"
+	"addl $0x04, %esp\n\t"	// Bypass corrupted parameter
+	"popl %ecx\n\t"		// Restore register
  	"popl %edx\n\t"
  #if defined(CONFIG_FRAME_POINTER)
  	"movl %ebp,%esp\n\t"
@@ -241,9 +245,11 @@
  	"movl  %esp,%ebp\n\t"
  #endif
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"		// Save register
+	"pushl %ecx\n\t"		// Passed parameter
  	"call __down_trylock\n\t"
-	"popl %ecx\n\t"
+	"addl $0x04, %esp\n\t"		// Bypass corrupted parameter
+	"popl %ecx\n\t"			// Restore register
  	"popl %edx\n\t"
  #if defined(CONFIG_FRAME_POINTER)
  	"movl %ebp,%esp\n\t"
@@ -259,9 +265,11 @@
  "__up_wakeup:\n\t"
  	"pushl %eax\n\t"
  	"pushl %edx\n\t"
-	"pushl %ecx\n\t"
+	"pushl %ecx\n\t"	// Save register
+	"pushl %ecx\n\t"	// Passed parameter
  	"call __up\n\t"
-	"popl %ecx\n\t"
+	"addl $0x04, %esp\n\t"	// Bypass corrupted parameter
+	"popl %ecx\n\t"		// Restore register
  	"popl %edx\n\t"
  	"popl %eax\n\t"
  	"ret"


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2004-11-03 21:23 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.58.0410181540080.2287@ppc970.osdl.org.suse.lists.linux.kernel>
     [not found] ` <417550FB.8020404@drdos.com.suse.lists.linux.kernel>
     [not found]   ` <1098218286.8675.82.camel@mentorng.gurulabs.com.suse.lists.linux.kernel>
     [not found]     ` <41757478.4090402@drdos.com.suse.lists.linux.kernel>
     [not found]       ` <20041020034524.GD10638@michonline.com.suse.lists.linux.kernel>
     [not found]         ` <1098245904.23628.84.camel@krustophenia.net.suse.lists.linux.kernel>
     [not found]           ` <1098247307.23628.91.camel@krustophenia.net.suse.lists.linux.kernel>
     [not found]             ` <Pine.LNX.4.61.0410200744310.10521@chaos.analogic.com.suse.lists.linux.kernel>
     [not found]               ` <Pine.LNX.4.61.0410290805570.11823@chaos.analogic.com.suse.lists.linux.kernel>
     [not found]                 ` <Pine.LNX.4.58.0410290740120.28839@ppc970.osdl.org.suse.lists.linux.kernel>
     [not found]                   ` <41826A7E.6020801@domdv.de.suse.lists.linux.kernel>
     [not found]                     ` <Pine.LNX.4.61.0410291255400.17270@chaos.analogic.com.suse.lists.linux.kernel>
     [not found]                       ` <Pine.LNX.4.58.0410291103000.28839@ppc970.osdl.org.suse.lists.linux.kernel>
     [not found]                         ` <Pine.LNX.4.61.0410291631250.8616@twinlark.arctic.org.suse.lists.linux.kernel>
2004-10-30  2:04                           ` Semaphore assembly-code bug Andi Kleen
     [not found]                   ` <Pine.LNX.4.61.0410291316470.3945@chaos.analogic.com.suse.lists.linux.kernel>
     [not found]                     ` <20041029175527.GB25764@redhat.com.suse.lists.linux.kernel>
     [not found]                       ` <Pine.LNX.4.61.0410291416040.4844@chaos.analogic.com.suse.lists.linux.kernel>
     [not found]                         ` <Pine.LNX.4.58.0410291133220.28839@ppc970.osdl.org.suse.lists.linux.kernel>
2004-10-30  2:13                           ` Andi Kleen
2004-10-30  9:28                             ` Denis Vlasenko
2004-10-30 17:53                               ` Linus Torvalds
2004-10-30 21:00                                 ` Denis Vlasenko
2004-10-30 21:14                                   ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
2004-10-30 22:11                                     ` Denis Vlasenko
2004-10-30 22:25                                       ` Lee Revell
2004-10-31 14:06                                         ` Diego Calleja
2004-10-31 20:53                                           ` Z Smith
2004-10-31 23:35                                             ` Rogério Brito
2004-11-01  1:20                                               ` Z Smith
2004-11-01 14:48                                             ` Diego Calleja
2004-11-01 15:09                                               ` [OT] " Russell Miller
2004-10-30 22:27                                       ` Tim Hockin
2004-10-30 22:44                                         ` Jeff Garzik
2004-10-30 22:50                                           ` Tim Hockin
2004-10-31 20:15                                           ` Theodore Ts'o
2004-10-31 20:21                                             ` Jeff Garzik
2004-10-31 21:06                                             ` Jan Engelhardt
2004-11-01 11:27                                             ` Alan Cox
2004-11-01 13:40                                               ` Denis Vlasenko
2004-11-01 23:04                                                 ` Alan Cox
2004-10-30 23:13                                         ` Denis Vlasenko
2004-10-30 22:45                                           ` Alan Cox
2004-10-31  1:21                                             ` Z Smith
2004-10-31  2:47                                               ` Jim Nelson
2004-10-31 15:19                                               ` Alan Cox
2004-10-31 20:18                                                 ` Z Smith
2004-11-01 11:05                                                   ` Alan Cox
2004-10-30 23:20                                           ` [OT] " Lee Revell
2004-10-30 22:52                                             ` Alan Cox
2004-10-31  1:09                                               ` Ken Moffat
2004-10-31  2:42                                                 ` Tim Connors
2004-10-31  4:45                                                   ` Paul
2004-10-31 14:44                                                 ` Alan Cox
2004-10-31  0:48                                             ` Andi Kleen
2004-10-30 23:28                                           ` Tim Hockin
2004-10-31  2:04                                             ` Michael Clark
2004-10-31  6:49                                         ` Jan Engelhardt
2004-10-31 21:09                                           ` Z Smith
2004-10-31 21:13                                             ` Jan Engelhardt
2004-10-31 21:48                                               ` Z Smith
2004-11-01 11:29                                               ` Alan Cox
2004-11-01 12:36                                                 ` Jan Engelhardt
2004-11-01 15:17                                           ` Lee Revell
2004-11-01 16:56                                             ` Kristian Høgsberg
2004-10-31  6:37                                     ` Jan Engelhardt
2004-10-31  0:39                                 ` Semaphore assembly-code bug Andi Kleen
2004-10-31  1:43                                   ` Linus Torvalds
2004-10-31  2:04                                     ` Andi Kleen
2004-10-18 22:45 Linux v2.6.9 Linus Torvalds
2004-10-19 17:38 ` Linux v2.6.9 and GPL Buyout Jeff V. Merkey
2004-10-19 20:38   ` Dax Kelson
2004-10-19 20:09     ` Jeff V. Merkey
2004-10-20  3:45       ` Ryan Anderson
2004-10-20  4:18         ` Lee Revell
2004-10-20  4:41           ` Lee Revell
2004-10-20 11:49             ` Richard B. Johnson
2004-10-29 12:12               ` Semaphore assembly-code bug linux-os
2004-10-29 14:46                 ` Linus Torvalds
2004-10-29 15:11                   ` Andi Kleen
2004-10-29 18:18                     ` Linus Torvalds
2004-10-29 18:35                       ` Richard Henderson
2004-10-29 16:06                   ` Andreas Steinmetz
2004-10-29 17:08                     ` linux-os
2004-10-29 18:06                       ` Linus Torvalds
2004-10-29 18:39                         ` linux-os
2004-10-29 19:12                           ` Linus Torvalds
2004-11-01  1:31                             ` linux-os
2004-11-01  5:49                               ` Linus Torvalds
2004-11-01 20:23                               ` dean gaudet
2004-11-01 20:52                                 ` linux-os
2004-11-01 21:23                                   ` dean gaudet
2004-11-01 22:22                                     ` linux-os
2004-11-01 21:40                                   ` Linus Torvalds
2004-11-01 21:46                                     ` Linus Torvalds
2004-11-02 15:02                                       ` linux-os
2004-11-02 16:02                                         ` Linus Torvalds
2004-11-02 16:06                                           ` Linus Torvalds
2004-11-02 16:51                                             ` linux-os
2004-11-01 22:16                                     ` linux-os
2004-11-01 22:26                                       ` Linus Torvalds
2004-11-01 23:14                                         ` linux-os
2004-11-01 23:42                                           ` Linus Torvalds
2004-11-03  1:52                                       ` Horst von Brand
2004-11-03 21:24                                       ` Bill Davidsen
2004-11-02  6:37                                     ` Chris Friesen
2004-10-29 18:58                         ` Andreas Steinmetz
2004-10-29 19:15                           ` Linus Torvalds
2004-10-29 19:40                             ` Andreas Steinmetz
2004-10-29 19:56                               ` Linus Torvalds
2004-10-29 22:07                                 ` Jeff Garzik
2004-10-29 23:50                               ` dean gaudet
2004-10-30  0:15                                 ` Linus Torvalds
2004-10-29 23:37                         ` dean gaudet
2004-10-29 17:22                   ` linux-os
2004-10-29 17:55                     ` Richard Henderson
2004-10-29 18:17                       ` linux-os
2004-10-29 18:42                         ` Linus Torvalds
2004-10-29 18:54                           ` Linus Torvalds
2004-10-30  3:35                           ` Jeff Garzik
2004-10-29 19:20                     ` Linus Torvalds
2004-10-29 19:26                       ` Linus Torvalds
2004-10-29 21:03                       ` Linus Torvalds
2004-10-29 17:57                   ` Richard Henderson
2004-10-29 18:37                   ` Gabriel Paubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.