* Re: Semaphore assembly-code bug
[not found] ` <Pine.LNX.4.61.0410291631250.8616@twinlark.arctic.org.suse.lists.linux.kernel>
@ 2004-10-30 2:04 ` Andi Kleen
0 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-30 2:04 UTC (permalink / raw)
To: dean gaudet
Cc: linux-os, Andreas Steinmetz, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka, linux-kernel, torvalds
dean gaudet <dean-list-linux-kernel@arctic.org> writes:
>
> it's worse than that in general -- lea typically goes through the AGU
> which has either less throughput or longer latency than the ALUs...
> depending on which x86en. it's 4 cycles for a lea on p4, vs. 1 for a pop.
> it's 2 cycles for a lea on k8 vs. 1 for a pop.
On D stepping and later K8 the lea is 1 cycle latency because the
decoder optimizes the lea into an add.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
[not found] ` <Pine.LNX.4.58.0410291133220.28839@ppc970.osdl.org.suse.lists.linux.kernel>
@ 2004-10-30 2:13 ` Andi Kleen
2004-10-30 9:28 ` Denis Vlasenko
0 siblings, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-30 2:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Linus Torvalds <torvalds@osdl.org> writes:
> Anyway, it's quite likely that for several CPU's the fastest sequence ends
> up actually being
>
> movl 4(%esp),%ecx
> movl 8(%esp),%edx
> movl 12(%esp),%eax
> addl $16,%esp
>
> which is also one of the biggest alternatives.
For K8 it should be the fastest way. K7 probably too.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-30 2:13 ` Andi Kleen
@ 2004-10-30 9:28 ` Denis Vlasenko
2004-10-30 17:53 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 9:28 UTC (permalink / raw)
To: Andi Kleen, Linus Torvalds; +Cc: linux-kernel
On Saturday 30 October 2004 05:13, Andi Kleen wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
>
> > Anyway, it's quite likely that for several CPU's the fastest sequence ends
> > up actually being
> >
> > movl 4(%esp),%ecx
> > movl 8(%esp),%edx
> > movl 12(%esp),%eax
> > addl $16,%esp
> >
> > which is also one of the biggest alternatives.
>
> For K8 it should be the fastest way. K7 probably too.
Pity. I always loved 1 byte insns :)
/me hopes that K8 rev E or K9 will have optimized pop.
--
vda
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-30 9:28 ` Denis Vlasenko
@ 2004-10-30 17:53 ` Linus Torvalds
2004-10-30 21:00 ` Denis Vlasenko
2004-10-31 0:39 ` Semaphore assembly-code bug Andi Kleen
0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-30 17:53 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Andi Kleen, linux-kernel
On Sat, 30 Oct 2004, Denis Vlasenko wrote:
>
> On Saturday 30 October 2004 05:13, Andi Kleen wrote:
> > Linus Torvalds <torvalds@osdl.org> writes:
> >
> > > Anyway, it's quite likely that for several CPU's the fastest sequence ends
> > > up actually being
> > >
> > > movl 4(%esp),%ecx
> > > movl 8(%esp),%edx
> > > movl 12(%esp),%eax
> > > addl $16,%esp
> > >
> > > which is also one of the biggest alternatives.
> >
> > For K8 it should be the fastest way. K7 probably too.
>
> Pity. I always loved 1 byte insns :)
I personally am a _huge_ believer in small code.
The sequence
popl %eax
popl %ecx
popl %edx
popl %eax
is four bytes. In contrast, the three moves and an add is 15 bytes. That's
almost 4 times as big.
And size _does_ matter. The extra 11 bytes means that if you have six of
these sequences in your program, you are pretty much _guaranteed_ one more
icache miss from memory. That's a few hundred cycles these days.
Considering that you _maybe_ won a cycle or two each time it was executed,
it's not at all clear that it's a win, except in benchmarks that have huge
repeat-rates. Real life doesn't usually have that. In many real-life
schenarios, repeat rates are in the tens of hundreds for most code...
And that's ignoring things like disk load times etc.
Sadly, the situation is often one where when you actually do all the
performance testing, you artificially increase the repeat-rates hugely:
you run the same program a thousand times in order to get a good profile,
and you keep in the the cache all the time. So performance analysis often
doesn't actually _see_ the downsides.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-30 17:53 ` Linus Torvalds
@ 2004-10-30 21:00 ` Denis Vlasenko
2004-10-30 21:14 ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
2004-10-31 0:39 ` Semaphore assembly-code bug Andi Kleen
1 sibling, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 21:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel
On Saturday 30 October 2004 20:53, Linus Torvalds wrote:
> > > > movl 4(%esp),%ecx
> > > > movl 8(%esp),%edx
> > > > movl 12(%esp),%eax
> > > > addl $16,%esp
> > > >
> > > > which is also one of the biggest alternatives.
> > >
> > > For K8 it should be the fastest way. K7 probably too.
> >
> > Pity. I always loved 1 byte insns :)
>
> I personally am a _huge_ believer in small code.
Thankfully you are not alone - a horde of uclibc/dietlibc/busybox
users shares these views. Also see http://smarden.org/pape/
> The sequence
>
> popl %eax
> popl %ecx
> popl %edx
> popl %eax
>
> is four bytes. In contrast, the three moves and an add is 15 bytes. That's
> almost 4 times as big.
>
> And size _does_ matter. The extra 11 bytes means that if you have six of
> these sequences in your program, you are pretty much _guaranteed_ one more
> icache miss from memory. That's a few hundred cycles these days.
> Considering that you _maybe_ won a cycle or two each time it was executed,
> it's not at all clear that it's a win, except in benchmarks that have huge
> repeat-rates. Real life doesn't usually have that. In many real-life
> schenarios, repeat rates are in the tens of hundreds for most code...
If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
15364 root 15 0 38008 26M 28496 S 0,0 10,8 0:57 0 kmail
20022 root 16 0 40760 24M 23920 S 0,1 10,0 0:04 0 mozilla-bin
1627 root 14 -1 71064 19M 53192 S < 0,1 7,9 3:16 0 X
1700 root 15 0 25348 16M 23508 S 0,1 6,5 0:46 0 kdeinit
3578 root 15 0 24032 14M 21524 S 0,5 5,8 0:23 0 konsole
--
vda
^ permalink raw reply [flat|nested] 99+ messages in thread
* code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 21:00 ` Denis Vlasenko
@ 2004-10-30 21:14 ` Lee Revell
2004-10-30 22:11 ` Denis Vlasenko
2004-10-31 6:37 ` Jan Engelhardt
0 siblings, 2 replies; 99+ messages in thread
From: Lee Revell @ 2004-10-30 21:14 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Linus Torvalds, Andi Kleen, linux-kernel
On Sun, 2004-10-31 at 00:00 +0300, Denis Vlasenko wrote:
> If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 15364 root 15 0 38008 26M 28496 S 0,0 10,8 0:57 0 kmail
> 20022 root 16 0 40760 24M 23920 S 0,1 10,0 0:04 0 mozilla-bin
> 1627 root 14 -1 71064 19M 53192 S < 0,1 7,9 3:16 0 X
> 1700 root 15 0 25348 16M 23508 S 0,1 6,5 0:46 0 kdeinit
> 3578 root 15 0 24032 14M 21524 S 0,5 5,8 0:23 0 konsole
Wow. evolution is now more bloated than kmail.
1424 rlrevell 15 0 125m 47m 29m S 7.8 10.1 1:41.78 evolution
1508 rlrevell 15 0 92432 30m 29m S 0.0 6.4 0:14.15 mozilla-bin
1090 root 16 0 55676 18m 40m S 24.8 3.9 0:46.98 XFree86
1379 rlrevell 15 0 33776 16m 18m S 0.3 3.5 0:06.65 nautilus
1377 rlrevell 15 0 19392 11m 15m S 0.0 2.5 0:03.29 gnome-panel
1458 rlrevell 16 0 28188 11m 15m S 3.9 2.5 0:10.44 gnome-terminal
1307 rlrevell 15 0 20828 11m 17m S 0.0 2.4 0:03.08 gnome-settings-
Lee
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 21:14 ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
@ 2004-10-30 22:11 ` Denis Vlasenko
2004-10-30 22:25 ` Lee Revell
2004-10-30 22:27 ` Tim Hockin
2004-10-31 6:37 ` Jan Engelhardt
1 sibling, 2 replies; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 22:11 UTC (permalink / raw)
To: Lee Revell; +Cc: Linus Torvalds, Andi Kleen, linux-kernel
On Sunday 31 October 2004 00:14, Lee Revell wrote:
> On Sun, 2004-10-31 at 00:00 +0300, Denis Vlasenko wrote:
> > If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
> >
> > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> > 15364 root 15 0 38008 26M 28496 S 0,0 10,8 0:57 0 kmail
> > 20022 root 16 0 40760 24M 23920 S 0,1 10,0 0:04 0 mozilla-bin
> > 1627 root 14 -1 71064 19M 53192 S < 0,1 7,9 3:16 0 X
> > 1700 root 15 0 25348 16M 23508 S 0,1 6,5 0:46 0 kdeinit
> > 3578 root 15 0 24032 14M 21524 S 0,5 5,8 0:23 0 konsole
>
> Wow. evolution is now more bloated than kmail.
>
> 1424 rlrevell 15 0 125m 47m 29m S 7.8 10.1 1:41.78 evolution
> 1508 rlrevell 15 0 92432 30m 29m S 0.0 6.4 0:14.15 mozilla-bin
> 1090 root 16 0 55676 18m 40m S 24.8 3.9 0:46.98 XFree86
> 1379 rlrevell 15 0 33776 16m 18m S 0.3 3.5 0:06.65 nautilus
> 1377 rlrevell 15 0 19392 11m 15m S 0.0 2.5 0:03.29 gnome-panel
> 1458 rlrevell 16 0 28188 11m 15m S 3.9 2.5 0:10.44 gnome-terminal
> 1307 rlrevell 15 0 20828 11m 17m S 0.0 2.4 0:03.08 gnome-settings-
Well, I can try to compile packages with different options
for size, I can link against small libc, but I feel this
does not solve the problem: the code itself is bloated...
I am not a code genius, but want to help.
Hmm probably some bloat-detection tools would be helpful,
like "show me source_lines/object_size ratios of fonctions in
this ELF object file". Those with low ratio are suspects of
excessive inlining etc.
More ideas, anyone?
--
vda
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:11 ` Denis Vlasenko
@ 2004-10-30 22:25 ` Lee Revell
2004-10-31 14:06 ` Diego Calleja
2004-10-30 22:27 ` Tim Hockin
1 sibling, 1 reply; 99+ messages in thread
From: Lee Revell @ 2004-10-30 22:25 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Linus Torvalds, Andi Kleen, linux-kernel
On Sun, 2004-10-31 at 01:11 +0300, Denis Vlasenko wrote:
> Well, I can try to compile packages with different options
> for size, I can link against small libc, but I feel this
> does not solve the problem: the code itself is bloated...
>
> I am not a code genius, but want to help.
>
> Hmm probably some bloat-detection tools would be helpful,
> like "show me source_lines/object_size ratios of fonctions in
> this ELF object file". Those with low ratio are suspects of
> excessive inlining etc.
>
> More ideas, anyone?
I ageww it's a hard problem. Right now there is massive pressure on
Linux application developers to add features to catch up with MS and
Apple. This inevitably leads to bloat, we all know that efficiency is
the first thing to go out the window in that situation, the problem is
exacerbated by the wide availability of fast machines. It's an old,
depressing story...
That being said it would indeed be nice if we had more tools to quantify
bloat.
Lee
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:11 ` Denis Vlasenko
2004-10-30 22:25 ` Lee Revell
@ 2004-10-30 22:27 ` Tim Hockin
2004-10-30 22:44 ` Jeff Garzik
` (2 more replies)
1 sibling, 3 replies; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 22:27 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
On Sun, Oct 31, 2004 at 01:11:07AM +0300, Denis Vlasenko wrote:
> I am not a code genius, but want to help.
>
> Hmm probably some bloat-detection tools would be helpful,
> like "show me source_lines/object_size ratios of fonctions in
> this ELF object file". Those with low ratio are suspects of
> excessive inlining etc.
The problem with apps of this sort is the multiple layers of abstraction.
Xlib, GLib, GTK, GNOME, Pango, XML, etc.
No one wants to duplicate effort (rightly so). Each of these libs tries
to do EVERY POSSIBLE thing. They all end up bloated. Then you have to
link them all in. You end up bloated. Then it is very easy to rely on
those libs for EVERYTHING, rather thank actually thinking.
So you end up with the mindset of, for example, "if it's text it's XML".
You have to parse everything as XML, when simple parsers would be tons
faster and simpler and smaller.
Bloat is cause by feature creep at every layer, not just the app.
Youck.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:27 ` Tim Hockin
@ 2004-10-30 22:44 ` Jeff Garzik
2004-10-30 22:50 ` Tim Hockin
2004-10-31 20:15 ` Theodore Ts'o
2004-10-30 23:13 ` Denis Vlasenko
2004-10-31 6:49 ` Jan Engelhardt
2 siblings, 2 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-30 22:44 UTC (permalink / raw)
To: Tim Hockin
Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
Tim Hockin wrote:
> So you end up with the mindset of, for example, "if it's text it's XML".
> You have to parse everything as XML, when simple parsers would be tons
> faster and simpler and smaller.
hehehe. One of the reasons why I like XML is that you don't have to
keep cloning new parsers.
Jeff
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:13 ` Denis Vlasenko
@ 2004-10-30 22:45 ` Alan Cox
2004-10-31 1:21 ` Z Smith
2004-10-30 23:20 ` [OT] " Lee Revell
2004-10-30 23:28 ` Tim Hockin
2 siblings, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-30 22:45 UTC (permalink / raw)
To: Denis Vlasenko
Cc: Tim Hockin, Lee Revell, Linus Torvalds, Andi Kleen,
Linux Kernel Mailing List
The gnome/gtk folks know they have a lot of code bloat, and know how to
shave about 10Mb off the desktop size already. What they don't have is
enough hands and brains to do this and the other stuff that is pressing.
So if the desktop stuff is annoying you join gnome-love or whatever the
kde equivalent is 8)
Alan
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:44 ` Jeff Garzik
@ 2004-10-30 22:50 ` Tim Hockin
2004-10-31 20:15 ` Theodore Ts'o
1 sibling, 0 replies; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 22:50 UTC (permalink / raw)
To: Jeff Garzik
Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
> Tim Hockin wrote:
> >So you end up with the mindset of, for example, "if it's text it's XML".
> >You have to parse everything as XML, when simple parsers would be tons
> >faster and simpler and smaller.
>
>
> hehehe. One of the reasons why I like XML is that you don't have to
> keep cloning new parsers.
I'm fine with XML, when it makes sense. In fact, I wrote an XML parser.
It's blazingly fast. But it doesn't try to do everything for everyone.
It does just as much as I needed. And Whn I need XML, I don;t have any
problem linking it in. It's only a couple hundred lines of C.
What irks me is best demonstrated by this:
At OLS last year or the year before, at a talk about DBUS, someone asked
about the DBUS protocol. When told that it was binary, they asked if
there was any advantage to that over text. The reply "We didn't want to
link an XML parser in".
Now, I am fine with not wanting to ad bloat. But umm, the question was
about TEXT, not XML. They are not the same thing. Not all text should be
XML.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:20 ` [OT] " Lee Revell
@ 2004-10-30 22:52 ` Alan Cox
2004-10-31 1:09 ` Ken Moffat
2004-10-31 0:48 ` Andi Kleen
1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-30 22:52 UTC (permalink / raw)
To: Lee Revell
Cc: Denis Vlasenko, Tim Hockin, Linus Torvalds, Andi Kleen,
Linux Kernel Mailing List
On Sul, 2004-10-31 at 00:20, Lee Revell wrote:
> I think very few application developers understand the point Linus made
> - that bigger code IS slower code due to cache misses. If this were
> widely understood we would be in pretty good shape.
On my laptop both Openoffice and gnome are measurably faster if you
build the lot with -Os (except a couple of image libs)
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:27 ` Tim Hockin
2004-10-30 22:44 ` Jeff Garzik
@ 2004-10-30 23:13 ` Denis Vlasenko
2004-10-30 22:45 ` Alan Cox
` (2 more replies)
2004-10-31 6:49 ` Jan Engelhardt
2 siblings, 3 replies; 99+ messages in thread
From: Denis Vlasenko @ 2004-10-30 23:13 UTC (permalink / raw)
To: Tim Hockin; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
On Sunday 31 October 2004 01:27, Tim Hockin wrote:
> On Sun, Oct 31, 2004 at 01:11:07AM +0300, Denis Vlasenko wrote:
> > I am not a code genius, but want to help.
> >
> > Hmm probably some bloat-detection tools would be helpful,
> > like "show me source_lines/object_size ratios of fonctions in
> > this ELF object file". Those with low ratio are suspects of
> > excessive inlining etc.
>
> The problem with apps of this sort is the multiple layers of abstraction.
>
> Xlib, GLib, GTK, GNOME, Pango, XML, etc.
I think it makes sense to start from lower layers first:
Kernel team is reasonably aware of the bloat danger.
glibc is worse, but thanks to heroic actions of Eric Andersen
we have mostly feature complete uclibc, 4 times (!)
smaller than glibc.
Xlib, GLib.... - didn't look into them apart from cases
when they do not build or in bug hunting sessions.
Quick data point: glib-1.2.10 is 1/2 of uclibc in size.
glib-2.2.2 is 2 times uclibc. x4 growth :(
> No one wants to duplicate effort (rightly so). Each of these libs tries
> to do EVERY POSSIBLE thing. They all end up bloated. Then you have to
> link them all in. You end up bloated. Then it is very easy to rely on
> those libs for EVERYTHING, rather thank actually thinking.
>
> So you end up with the mindset of, for example, "if it's text it's XML".
> You have to parse everything as XML, when simple parsers would be tons
> faster and simpler and smaller.
>
> Bloat is cause by feature creep at every layer, not just the app.
I actually tried to convince maintainers of one package
that their code is needlessly complex. I did send patches
to remedy that a bit while fixing real bugs. Rejected.
Bugs were planned to be fixed by adding more code.
I've lost all hope on that case.
I guess this is a reason why bloat problem tend to be solved
by rewrite from scratch. I could name quite a few cases:
glibc -> dietlibc,uclibc
coreutils -> busybox
named -> djbdns
inetd -> daemontools+ucspi-tcp
sendmail -> qmail
syslogd -> socklog (http://smarden.org/socklog/)
It's sort of frightening that someone will need to
rewrite Xlib or, say, OpenOffice :(
--
vda
^ permalink raw reply [flat|nested] 99+ messages in thread
* [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:13 ` Denis Vlasenko
2004-10-30 22:45 ` Alan Cox
@ 2004-10-30 23:20 ` Lee Revell
2004-10-30 22:52 ` Alan Cox
2004-10-31 0:48 ` Andi Kleen
2004-10-30 23:28 ` Tim Hockin
2 siblings, 2 replies; 99+ messages in thread
From: Lee Revell @ 2004-10-30 23:20 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Tim Hockin, Linus Torvalds, Andi Kleen, linux-kernel
On Sun, 2004-10-31 at 02:13 +0300, Denis Vlasenko wrote:
> It's sort of frightening that someone will need to
> rewrite Xlib or, say, OpenOffice :(
I think very few application developers understand the point Linus made
- that bigger code IS slower code due to cache misses. If this were
widely understood we would be in pretty good shape.
Lee
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:13 ` Denis Vlasenko
2004-10-30 22:45 ` Alan Cox
2004-10-30 23:20 ` [OT] " Lee Revell
@ 2004-10-30 23:28 ` Tim Hockin
2004-10-31 2:04 ` Michael Clark
2 siblings, 1 reply; 99+ messages in thread
From: Tim Hockin @ 2004-10-30 23:28 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
On Sun, Oct 31, 2004 at 02:13:37AM +0300, Denis Vlasenko wrote:
> > Bloat is cause by feature creep at every layer, not just the app.
>
> I actually tried to convince maintainers of one package
> that their code is needlessly complex. I did send patches
> to remedy that a bit while fixing real bugs. Rejected.
> Bugs were planned to be fixed by adding more code.
> I've lost all hope on that case.
See, there is an ego problem, too. If you rewrite my code, it means
you're better than I am. Rejected.
Features win over efficiency. Seriously, look at glibc. Hav eyou ever
tried to fix a bug in it? Holy CRAP is that horrible code. Each chunk of
code itself is OK (though it abuses macrso so thoroughly I hesitate to
call it C code). But it tried to support every architecture x every OS.
You know what? I don't CARE if the glibc code compiles on HPUX or not.
HPUX has it's own libc.
> I guess this is a reason why bloat problem tend to be solved
> by rewrite from scratch. I could name quite a few cases:
From-scratch is a huge risk. But yeah, sometimes it has to be.
> It's sort of frightening that someone will need to
> rewrite Xlib or, say, OpenOffice :(
Never gonna happen.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-30 17:53 ` Linus Torvalds
2004-10-30 21:00 ` Denis Vlasenko
@ 2004-10-31 0:39 ` Andi Kleen
2004-10-31 1:43 ` Linus Torvalds
1 sibling, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-31 0:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Denis Vlasenko, Andi Kleen, linux-kernel
> I personally am a _huge_ believer in small code.
>
> The sequence
>
> popl %eax
> popl %ecx
> popl %edx
> popl %eax
>
> is four bytes. In contrast, the three moves and an add is 15 bytes. That's
> almost 4 times as big.
Using the long stack setup code was found to be a significant
win when enough registers were saved (several percent in real benchmarks)
on K8 gcc. It speed up all function calls considerably because it
eliminates several stalls for each function entry/exit. The popls
will all depend on each other because of their implicied reference
to esp.
Yes, it bloats the code, but function calls happen so often that having them
faster is really noticeable.
The K8 has quite big caches and is not decoding limited, so it
wasn't a too bad tradeoff there.
Ideally you would want to only do it on hot functions and optimize
rarely called functions for code size, but that would require profile
feedback which is often not feasible (JITs have an advantage here)
Unfortunately I don't think it is practically feasible for the kernel because
we rely on to be able to recreate the same vmlinuxs for debugging.
[It's a pity actually because modern compilers do a lot better
with profile feedback]
On P4 on the other hand it doesn't help at all and only makes
the code bigger. I did it from hand in the x86-64 syscall
code too (that was before there was EM64T, but I still think it was a
good idea). Perhaps AMD adds special hardware in some future CPU that
also makes it unnecessary, but currently it's like this and it helps.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:20 ` [OT] " Lee Revell
2004-10-30 22:52 ` Alan Cox
@ 2004-10-31 0:48 ` Andi Kleen
1 sibling, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-31 0:48 UTC (permalink / raw)
To: Lee Revell
Cc: Denis Vlasenko, Tim Hockin, Linus Torvalds, Andi Kleen, linux-kernel
On Sat, Oct 30, 2004 at 07:20:04PM -0400, Lee Revell wrote:
> On Sun, 2004-10-31 at 02:13 +0300, Denis Vlasenko wrote:
> > It's sort of frightening that someone will need to
> > rewrite Xlib or, say, OpenOffice :(
>
> I think very few application developers understand the point Linus made
> - that bigger code IS slower code due to cache misses. If this were
> widely understood we would be in pretty good shape.
It's true in some cases, but not true in others. Don't make it your
gospel.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:52 ` Alan Cox
@ 2004-10-31 1:09 ` Ken Moffat
2004-10-31 2:42 ` Tim Connors
2004-10-31 14:44 ` Alan Cox
0 siblings, 2 replies; 99+ messages in thread
From: Ken Moffat @ 2004-10-31 1:09 UTC (permalink / raw)
To: Alan Cox
Cc: Lee Revell, Denis Vlasenko, Tim Hockin, Linus Torvalds,
Andi Kleen, Linux Kernel Mailing List
On Sat, 30 Oct 2004, Alan Cox wrote:
> On Sul, 2004-10-31 at 00:20, Lee Revell wrote:
> > I think very few application developers understand the point Linus made
> > - that bigger code IS slower code due to cache misses. If this were
> > widely understood we would be in pretty good shape.
>
> On my laptop both Openoffice and gnome are measurably faster if you
> build the lot with -Os (except a couple of image libs)
>
Depends how much of gnome you use. I used to swear by -Os for
non-toolchain stuff, but in the end I got bitten by gnumeric on x86.
http://bugs.gnome.org/show_bug.cgi?id=128834 is similar, but in my case
opening *any* spreadsheet would cause gnumeric to segfault (gcc-3.3
series). Add in the time spent rebuilding gnome before I found this bug
report, and adding extra parts of gnome just in case I missed something,
and the time to load it is irrelevant. Since then I've had an anecdotal
report that -Os is known to cause problems with gnome. I s'pose people
will say it serves me right for doing my initial testing on ppc which
didn't have this problem ;) The point is that -Os is *much* less tested
than -O2 at the moment.
Ken
--
das eine Mal als Tragödie, das andere Mal als Farce
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:45 ` Alan Cox
@ 2004-10-31 1:21 ` Z Smith
2004-10-31 2:47 ` Jim Nelson
2004-10-31 15:19 ` Alan Cox
0 siblings, 2 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31 1:21 UTC (permalink / raw)
Cc: Linux Kernel Mailing List
Alan Cox wrote:
> So if the desktop stuff is annoying you join gnome-love or whatever the
> kde equivalent is 8)
Or join me in my effort to limit bloat. Why use an X server
that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
of code with very minimal kmallocing?
home.comcast.net/~plinius/fbui.html
Zack Smith
Bloat Liberation Front
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-31 0:39 ` Semaphore assembly-code bug Andi Kleen
@ 2004-10-31 1:43 ` Linus Torvalds
2004-10-31 2:04 ` Andi Kleen
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-31 1:43 UTC (permalink / raw)
To: Andi Kleen; +Cc: Denis Vlasenko, linux-kernel
On Sun, 31 Oct 2004, Andi Kleen wrote:
>
> Using the long stack setup code was found to be a significant
> win when enough registers were saved (several percent in real benchmarks)
> on K8 gcc.
For _what_?
Real applications, or SpecInt?
The fact is, SpecInt is not very interesting, because it has almost _zero_
icache footprint, and it has generally big repeat-rates, and to make
matters worse, you are allowed (and everybody does) warm up the caches by
running before you actually do the benchmark run.
_None_ of these are realistic for real life workloads.
> It speed up all function calls considerably because it
> eliminates several stalls for each function entry/exit.
.. it shaves off a few cycles in the cached case, yes.
> The popls will all depend on each other because of their implicied
> reference to esp.
Which is only true on moderately stupid CPU's. Two pop's don't _really_
depend on each other in any real sense, and there are CPU's that will
happily dual-issue them, or at least not stall in between (ie the pop's
will happily keep the memory unit 100% busy).
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 23:28 ` Tim Hockin
@ 2004-10-31 2:04 ` Michael Clark
0 siblings, 0 replies; 99+ messages in thread
From: Michael Clark @ 2004-10-31 2:04 UTC (permalink / raw)
To: Tim Hockin
Cc: Denis Vlasenko, Lee Revell, Linus Torvalds, Andi Kleen, linux-kernel
On 10/31/04 07:28, Tim Hockin wrote:
> On Sun, Oct 31, 2004 at 02:13:37AM +0300, Denis Vlasenko wrote:
>
>>>Bloat is cause by feature creep at every layer, not just the app.
>>
>>I actually tried to convince maintainers of one package
>>that their code is needlessly complex. I did send patches
>>to remedy that a bit while fixing real bugs. Rejected.
>>Bugs were planned to be fixed by adding more code.
>>I've lost all hope on that case.
>
>
> See, there is an ego problem, too. If you rewrite my code, it means
> you're better than I am. Rejected.
>
> Features win over efficiency. Seriously, look at glibc. Hav eyou ever
> tried to fix a bug in it? Holy CRAP is that horrible code. Each chunk of
> code itself is OK (though it abuses macrso so thoroughly I hesitate to
> call it C code). But it tried to support every architecture x every OS.
> You know what? I don't CARE if the glibc code compiles on HPUX or not.
> HPUX has it's own libc.
>
>
>>I guess this is a reason why bloat problem tend to be solved
>>by rewrite from scratch. I could name quite a few cases:
>
>
> From-scratch is a huge risk. But yeah, sometimes it has to be.
>
>
>>It's sort of frightening that someone will need to
>>rewrite Xlib or, say, OpenOffice :(
Well, the xlib rewrite is happening (XCB/XCL).
One of the reasons cited is the size of xlib.
http://www.freedesktop.org/Software/xcb
~mc
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-31 1:43 ` Linus Torvalds
@ 2004-10-31 2:04 ` Andi Kleen
0 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2004-10-31 2:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andi Kleen, Denis Vlasenko, linux-kernel
On Sat, Oct 30, 2004 at 06:43:21PM -0700, Linus Torvalds wrote:
>
>
> On Sun, 31 Oct 2004, Andi Kleen wrote:
> >
> > Using the long stack setup code was found to be a significant
> > win when enough registers were saved (several percent in real benchmarks)
> > on K8 gcc.
>
> For _what_?
>
> Real applications, or SpecInt?
iirc gcc itself was faster (the modern one, not the old version in SpecInt)
KDE startup ended up being faster too, but that may have been due to other
improvements too.
This was all tested on CPUs with very large caches (1MB L2), you
can pack a lot of code into that.
Also when people benchmark -m64 code compared to -m32 they often
see large improvements on AMD64 (as long as the code isn't long or pointer
memory bound), and I suspect at least part of that can be explained
by the -m64 gcc defaulting to the long function prologues.
Another example of larger code usually being better is x87 vs SSE2 floating
point math.
> The fact is, SpecInt is not very interesting, because it has almost _zero_
> icache footprint, and it has generally big repeat-rates, and to make
I don't think it's generally true. one counter example is the gcc subtest
in SpecInt.
> > It speed up all function calls considerably because it
> > eliminates several stalls for each function entry/exit.
>
> .. it shaves off a few cycles in the cached case, yes.
I would expect it to help in the uncached case too because
the CPU does very aggressive prefetching of code. Once
it gets started on a function it will fetch it very quickly.
>
> > The popls will all depend on each other because of their implicied
> > reference to esp.
>
> Which is only true on moderately stupid CPU's. Two pop's don't _really_
I don't see the K8 as a stupid CPU.
> depend on each other in any real sense, and there are CPU's that will
> happily dual-issue them, or at least not stall in between (ie the pop's
> will happily keep the memory unit 100% busy).
Yes, there are. And there are others that don't.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 1:09 ` Ken Moffat
@ 2004-10-31 2:42 ` Tim Connors
2004-10-31 4:45 ` Paul
2004-10-31 14:44 ` Alan Cox
1 sibling, 1 reply; 99+ messages in thread
From: Tim Connors @ 2004-10-31 2:42 UTC (permalink / raw)
To: Ken Moffat; +Cc: Linux Kernel Mailing List
Ken Moffat <ken@kenmoffat.uklinux.net> said on Sun, 31 Oct 2004 01:09:54 +0000 (GMT):
> and the time to load it is irrelevant. Since then I've had an anecdotal
> report that -Os is known to cause problems with gnome. I s'pose people
> will say it serves me right for doing my initial testing on ppc which
> didn't have this problem ;) The point is that -Os is *much* less tested
> than -O2 at the moment.
Because people suck, and don't use it and hence test it.
Ie, test it!
I can't, because I prefer to stay away from gnome instead.
--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
"Warning: Do not look into laser with remaining eye" -- a physics experiment
"Press emergency laser shutdown button with remaining hand" -- J.D.Baldwin @ ASR
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 1:21 ` Z Smith
@ 2004-10-31 2:47 ` Jim Nelson
2004-10-31 15:19 ` Alan Cox
1 sibling, 0 replies; 99+ messages in thread
From: Jim Nelson @ 2004-10-31 2:47 UTC (permalink / raw)
To: Z Smith; +Cc: Linux Kernel Mailing List
Z Smith wrote:
> Alan Cox wrote:
>
>> So if the desktop stuff is annoying you join gnome-love or whatever the
>> kde equivalent is 8)
>
>
> Or join me in my effort to limit bloat. Why use an X server
> that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> of code with very minimal kmallocing?
>
> home.comcast.net/~plinius/fbui.html
>
> Zack Smith
> Bloat Liberation Front
>
Because some of us use remote X clients on big iron with an X server on your
desktop. IIRC (been a long time since my CAD classes), a whole bunch of FEA and
CAE/CAD applications worked this way.
There is a lot more flexibility inherent in user-space compared to kernel-space.
You can use PAM, Kerberos, and a whole host of other security devices that would
be difficult to implement efficiently in kernel-space.
Dude, that's a cool hack, but just about everything you did could be done with
svgalib and the input core interface. The advantage to svgalib is that if that
interface dies, you can recover the machine pretty easily, whereas kernel panics
are a bit more disruptive.
Still - it would be a nifty add-on for POS terminals, etc., just not the kind of
thing I'd expect to see in the kernel anytime soon. Once 2.7 is started, see if
people are more receptive. Take the time to flesh it out, get some more people on
board, see if Sourceforge will host the project, and lose the advertising campaign
- that's not likely to win any friends or supporters around here.
I don't mean to be harsh, but c'mon - "Bloat Liberation Front" - err... okaaay...
Good luck,
Jim
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 2:42 ` Tim Connors
@ 2004-10-31 4:45 ` Paul
0 siblings, 0 replies; 99+ messages in thread
From: Paul @ 2004-10-31 4:45 UTC (permalink / raw)
To: Linux Kernel Mailing List
Tim Connors <tconnors+linuxkernel1099190446@astro.swin.edu.au>, on Sun Oct 31, 2004 [01:42:34 PM] said:
> Ken Moffat <ken@kenmoffat.uklinux.net> said on Sun, 31 Oct 2004 01:09:54 +0000 (GMT):
> > and the time to load it is irrelevant. Since then I've had an anecdotal
> > report that -Os is known to cause problems with gnome. I s'pose people
> > will say it serves me right for doing my initial testing on ppc which
> > didn't have this problem ;) The point is that -Os is *much* less tested
> > than -O2 at the moment.
>
> Because people suck, and don't use it and hence test it.
>
> Ie, test it!
>
> I can't, because I prefer to stay away from gnome instead.
>
Hi;
Ive been using -Os as my default compile flag under
Gentoo for probably over 2 years now. Havent noted any real
problems, and thats nearly 3gig of compressed source code
compiled on what is just my current system image.
(Well, I might suck a little because I havent done any
benchmarks or comparisons as to the actual benifits of doing
so. Also, I use fvwm;)
Paul
set@pobox.com
> --
> TimC -- http://astronomy.swin.edu.au/staff/tconnors/
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 21:14 ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
2004-10-30 22:11 ` Denis Vlasenko
@ 2004-10-31 6:37 ` Jan Engelhardt
1 sibling, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 6:37 UTC (permalink / raw)
To: linux-kernel; +Cc: Denis Vlasenko, Linus Torvalds
>> If only glibc / X / KDE / OpenOffice (ugggh) people could hear you more...
>>
>> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
>> 15364 root 15 0 38008 26M 28496 S 0,0 10,8 0:57 0 kmail
>> 20022 root 16 0 40760 24M 23920 S 0,1 10,0 0:04 0 mozilla-bin
>> 1627 root 14 -1 71064 19M 53192 S < 0,1 7,9 3:16 0 X
>> 1700 root 15 0 25348 16M 23508 S 0,1 6,5 0:46 0 kdeinit
>> 3578 root 15 0 24032 14M 21524 S 0,5 5,8 0:23 0 konsole
Heh, and guess what: the people in #kde (irc.freenode.net for example) deny
that it's their fault with the statement "bah, that's shared libraries"!
If that's a lie or not, or a semi-lie, I'm definitely sure THAT libdcop libmcop
and every shitcrap that's running makes it almost impossible to run even on
Duron-800 w/256.
>Wow. evolution is now more bloated than kmail.
>
> 1424 rlrevell 15 0 125m 47m 29m S 7.8 10.1 1:41.78 evolution
> 1508 rlrevell 15 0 92432 30m 29m S 0.0 6.4 0:14.15 mozilla-bin
> 1090 root 16 0 55676 18m 40m S 24.8 3.9 0:46.98 XFree86
> 1379 rlrevell 15 0 33776 16m 18m S 0.3 3.5 0:06.65 nautilus
> 1377 rlrevell 15 0 19392 11m 15m S 0.0 2.5 0:03.29 gnome-panel
> 1458 rlrevell 16 0 28188 11m 15m S 3.9 2.5 0:10.44 gnome-terminal
> 1307 rlrevell 15 0 20828 11m 17m S 0.0 2.4 0:03.08 gnome-settings-
Gnome is no better. (Flamewar: I like ICEWM)
The only thing more bloated is the X server itself when it runs with the
proprietary NV GL core:
USER PID MEM% VSZ RSZ STAT START TIME COMMAND
root 5220 7.8 417872 20220 SL 08:37 0:03 X -noliste[...]
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:27 ` Tim Hockin
2004-10-30 22:44 ` Jeff Garzik
2004-10-30 23:13 ` Denis Vlasenko
@ 2004-10-31 6:49 ` Jan Engelhardt
2004-10-31 21:09 ` Z Smith
2004-11-01 15:17 ` Lee Revell
2 siblings, 2 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 6:49 UTC (permalink / raw)
Cc: linux-kernel
>> Hmm probably some bloat-detection tools would be helpful,
>> like "show me source_lines/object_size ratios of fonctions in
>> this ELF object file". Those with low ratio are suspects of
>> excessive inlining etc.
Hm, I've got a (very simple) line determining utility,
http://linux01.org:2222/f/UHXT/bin/sourcefuncsize
maybe someone can pipe it together with ls -l or whatever.
>The problem with apps of this sort is the multiple layers of abstraction.
>
>Xlib, GLib, GTK, GNOME, Pango, XML, etc.
At least they know one thing: that thou should not stuff everything into one
.so but multiple ones (if it's a lot). That /may/ reduce the size-in-memory,
because not all .so's need to be loaded. OTOH, most apps load /all/ anyway.
Heh, there we go.
>Bloat is cause by feature creep at every layer, not just the app.
I sense Java and C# being the best example.
Z Smith wrote:
>Or join me in my effort to limit bloat. Why use an X server
>that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
>of code with very minimal kmallocing?
FBUI does not have 3d acceleration?
Ken Moffat wrote:
>>The point is that -Os is *much* less tested
>>than -O2 at the moment.
>Because people suck, and don't use it and hence test it.
I doubt even the -O2-only-people use gprof/gcov frequently. :(
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:25 ` Lee Revell
@ 2004-10-31 14:06 ` Diego Calleja
2004-10-31 20:53 ` Z Smith
0 siblings, 1 reply; 99+ messages in thread
From: Diego Calleja @ 2004-10-31 14:06 UTC (permalink / raw)
To: Lee Revell; +Cc: vda, torvalds, ak, linux-kernel
El Sat, 30 Oct 2004 18:25:38 -0400 Lee Revell <rlrevell@joe-job.com> escribió:
> I ageww it's a hard problem. Right now there is massive pressure on
> Linux application developers to add features to catch up with MS and
> Apple. This inevitably leads to bloat, we all know that efficiency is
I don't think it's so bad (ie: it could be _worse_)
There's some work going on to fix some "bloat problems" too, for example
the x.org people are working in a sort of xlib complement/replacement (i
don't know its real purpose) xcb which should help latency and code
size. Composite itself is a nice way of avoiding that apps redraw their
windows all the time. KDE "speed" is better is much better than a year
ago, gnome 2.8 is also somewhat "faster" (compare nautilus in gnome 2.6
vs the one in 2.8). Openoffice 2.0 also will have some "performance
improvements" (see http://development.openoffice.org/releases/q-concept.html#4.1.3.Performance|outline
and http://development.openoffice.org/releases/q-concept.html#3.1.3.Performance|outline)
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 1:09 ` Ken Moffat
2004-10-31 2:42 ` Tim Connors
@ 2004-10-31 14:44 ` Alan Cox
1 sibling, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-10-31 14:44 UTC (permalink / raw)
To: Ken Moffat
Cc: Lee Revell, Denis Vlasenko, Tim Hockin, Linus Torvalds,
Andi Kleen, Linux Kernel Mailing List
On Sul, 2004-10-31 at 01:09, Ken Moffat wrote:
> and the time to load it is irrelevant. Since then I've had an anecdotal
> report that -Os is known to cause problems with gnome. I s'pose people
> will say it serves me right for doing my initial testing on ppc which
> didn't have this problem ;) The point is that -Os is *much* less tested
> than -O2 at the moment.
I've seen no real problems - x86-32 or x86-64, and my gnumeric appears
happy. Could be that the Red Hat gcc 3.3 has the relevant fixes already
in it from upstream I guess.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 1:21 ` Z Smith
2004-10-31 2:47 ` Jim Nelson
@ 2004-10-31 15:19 ` Alan Cox
2004-10-31 20:18 ` Z Smith
1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-10-31 15:19 UTC (permalink / raw)
To: Z Smith; +Cc: Linux Kernel Mailing List
On Sul, 2004-10-31 at 01:21, Z Smith wrote:
> Alan Cox wrote:
>
> > So if the desktop stuff is annoying you join gnome-love or whatever the
> > kde equivalent is 8)
>
> Or join me in my effort to limit bloat. Why use an X server
> that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> of code with very minimal kmallocing?
My X server seems to be running at about 4Mbytes, plus the frame buffer
mappings which make it appear a lot larger. I wouldn't be suprised if
half the 4Mb was pixmap cache too, maybe more.
I've helped write tiny UI kits (take a look at nanogui for example) but
they don't have the flexibility of X.
Alan
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-30 22:44 ` Jeff Garzik
2004-10-30 22:50 ` Tim Hockin
@ 2004-10-31 20:15 ` Theodore Ts'o
2004-10-31 20:21 ` Jeff Garzik
` (2 more replies)
1 sibling, 3 replies; 99+ messages in thread
From: Theodore Ts'o @ 2004-10-31 20:15 UTC (permalink / raw)
To: Jeff Garzik; +Cc: linux-kernel
On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
> Tim Hockin wrote:
> >So you end up with the mindset of, for example, "if it's text it's XML".
> >You have to parse everything as XML, when simple parsers would be tons
> >faster and simpler and smaller.
>
> hehehe. One of the reasons why I like XML is that you don't have to
> keep cloning new parsers.
.... if you don't mind bloating your application:
% ls -l /usr/lib/libxml2.a
4224 -rw-r--r-- 1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a
- Ted
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 15:19 ` Alan Cox
@ 2004-10-31 20:18 ` Z Smith
2004-11-01 11:05 ` Alan Cox
0 siblings, 1 reply; 99+ messages in thread
From: Z Smith @ 2004-10-31 20:18 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
Alan Cox wrote:
> My X server seems to be running at about 4Mbytes, plus the frame buffer
> mappings which make it appear a lot larger. I wouldn't be suprised if
> half the 4Mb was pixmap cache too, maybe more.
At first sight that sounds like a plausible explanation, however
the facts in my case suggest something else is going on:
My laptop's framebuffer is only 800x600x24bpp VESA, or 1406kB.
But look at what X is doing:
root 632 6.1 17.5 22024 16440 ? S 12:05 0:17 X :0
The more apps in use, the more memory is used, but at the moment
I've only got xterm, rxvt, thunderbird, xclock and xload. My wm is
blackbox which is using 5 megs.
Also, just curious but why would memory-mapped I/O be counted
in the memory usage anyway? Shouldn't there be a separate number
for framebuffer memory and the like?
> I've helped write tiny UI kits (take a look at nanogui for example) but
> they don't have the flexibility of X.
In my experience, most of the flexibility is not necessary for
97% of what I do, yet it evidently costs a lot in memory usage
and speed.
Zack
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:15 ` Theodore Ts'o
@ 2004-10-31 20:21 ` Jeff Garzik
2004-10-31 21:06 ` Jan Engelhardt
2004-11-01 11:27 ` Alan Cox
2 siblings, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-31 20:21 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-kernel
Theodore Ts'o wrote:
> On Sat, Oct 30, 2004 at 06:44:10PM -0400, Jeff Garzik wrote:
>
>>Tim Hockin wrote:
>>
>>>So you end up with the mindset of, for example, "if it's text it's XML".
>>>You have to parse everything as XML, when simple parsers would be tons
>>>faster and simpler and smaller.
>>
>>hehehe. One of the reasons why I like XML is that you don't have to
>>keep cloning new parsers.
>
>
> .... if you don't mind bloating your application:
>
> % ls -l /usr/lib/libxml2.a
> 4224 -rw-r--r-- 1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a
GLib's is a lot smaller :)
Jeff
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 14:06 ` Diego Calleja
@ 2004-10-31 20:53 ` Z Smith
2004-10-31 23:35 ` Rogério Brito
2004-11-01 14:48 ` Diego Calleja
0 siblings, 2 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31 20:53 UTC (permalink / raw)
To: Diego Calleja; +Cc: linux-kernel
Diego Calleja wrote:
> I don't think it's so bad (ie: it could be _worse_)
But not everyone can tolerate today's level of bloat.
Imagine a small charity in a rural town in Bolivia or
Colorado. They have no budget for computers and no one
is offering donations. A local person put Linux on their 200 MHz
system after Windows crashed and the Windows CD couldn't
be found, but he can't put KDE or Gnome on it as well because
that would bring it to a crawl. The only way to make the
computer usable is to install an old distribution of Linux
from 1998 which has Netscape 4 but no office app. Eventually
they will give up on the computer and just throw it out,
because they can't wait forever for programmers to write
non-bloated software to make good use of their system.
The machine ends up at a landfill where it leeches chemicals
into the local water supply.
Zack
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:15 ` Theodore Ts'o
2004-10-31 20:21 ` Jeff Garzik
@ 2004-10-31 21:06 ` Jan Engelhardt
2004-11-01 11:27 ` Alan Cox
2 siblings, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 21:06 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Jeff Garzik, linux-kernel
>.... if you don't mind bloating your application:
>
>% ls -l /usr/lib/libxml2.a
>4224 -rw-r--r-- 1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a
Whoa. Bet its creator compiled with -g -O2 rather than -g0 -O2. ANd with
-static instead of with <dynamic>. Yay look at this:
22:06 io:~ # l /usr/lib/libxml2.so -L
#SUSE# -rwxr-xr-x 1 root root 1145089 Apr 6 2004 /usr/lib/libxml2.so
4x smaller!
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 6:49 ` Jan Engelhardt
@ 2004-10-31 21:09 ` Z Smith
2004-10-31 21:13 ` Jan Engelhardt
2004-11-01 15:17 ` Lee Revell
1 sibling, 1 reply; 99+ messages in thread
From: Z Smith @ 2004-10-31 21:09 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: linux-kernel
Jan Engelhardt wrote:
> FBUI does not have 3d acceleration?
The problem is 3d non-acceleration i.e. VESA and VGA
would still have to be supported. I'm no 3d expert but
I think there must be some software-based 3d function
would require using floating point, which isn't allowed
in the kernel.
Also, might not software 3d open the kernel up to
patent issues?
Zachary Smith
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 21:09 ` Z Smith
@ 2004-10-31 21:13 ` Jan Engelhardt
2004-10-31 21:48 ` Z Smith
2004-11-01 11:29 ` Alan Cox
0 siblings, 2 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-10-31 21:13 UTC (permalink / raw)
To: Z Smith; +Cc: linux-kernel
>> FBUI does not have 3d acceleration?
>
>The problem is 3d non-acceleration i.e. VESA and VGA
>would still have to be supported. I'm no 3d expert but
>I think there must be some software-based 3d function
>would require using floating point, which isn't allowed
>in the kernel.
>
>Also, might not software 3d open the kernel up to
>patent issues?
Whatever you do, 3D at the software level is slow, even with a fast comp.
See MESA.
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 21:13 ` Jan Engelhardt
@ 2004-10-31 21:48 ` Z Smith
2004-11-01 11:29 ` Alan Cox
1 sibling, 0 replies; 99+ messages in thread
From: Z Smith @ 2004-10-31 21:48 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: linux-kernel
Jan Engelhardt wrote:
>>Also, might not software 3d open the kernel up to
>>patent issues?
>
> Whatever you do, 3D at the software level is slow, even with a fast comp.
> See MESA.
Well it might be nice to add support for hardware 3-D, once 2-D
is mature. In fact I imagine it could be very convenient for
some people.
ZS
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:53 ` Z Smith
@ 2004-10-31 23:35 ` Rogério Brito
2004-11-01 1:20 ` Z Smith
2004-11-01 14:48 ` Diego Calleja
1 sibling, 1 reply; 99+ messages in thread
From: Rogério Brito @ 2004-10-31 23:35 UTC (permalink / raw)
To: Z Smith; +Cc: Diego Calleja, linux-kernel
Z Smith wrote:
> But not everyone can tolerate today's level of bloat.
>
> Imagine a small charity in a rural town in Bolivia or
> Colorado. They have no budget for computers and no one
> is offering donations.
Well, let me jump into this thread. I don't live in Bolivia or Colorado,
but I do live in Brazil.
The fastest computer that I have at my disposal is this one with a Duron
600MHz processor. My father uses a Pentium MMX 200MHz with 64MB of RAM.
Unfortunately, for financial reasons, I don't see we upgrading our
computers too soo.
It is nice to read Alan Cox saying that the Gnome team can make Gnome
use less memory in the future. I'm anxiously looking forward to that. In
the mean time, I will be using fluxbox and hoping that other parts of
the system (libraries etc) don't grow too fast for my computers.
I know plenty of people in the same situation that I am. Given the
choice of purchasing a book for my education or upgrading my computer, I
guess that I should spend money on the former.
And the same is true for many of my relatives and friends.
Rogério Brito.
--
Learn to quote e-mails decently at:
http://pub.tsn.dk/how-to-quote.php
http://learn.to/quote
http://www.xs4all.nl/~sbpoley/toppost.htm
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 23:35 ` Rogério Brito
@ 2004-11-01 1:20 ` Z Smith
0 siblings, 0 replies; 99+ messages in thread
From: Z Smith @ 2004-11-01 1:20 UTC (permalink / raw)
To: Rogério Brito; +Cc: Diego Calleja, linux-kernel
Rogério Brito wrote:
> Z Smith wrote:
> The fastest computer that I have at my disposal is this one with a Duron
> 600MHz processor. My father uses a Pentium MMX 200MHz with 64MB of RAM.
> Unfortunately, for financial reasons, I don't see we upgrading our
> computers too soo.
It seems that as time goes by, more and more people are
coming to be financially limited. In some cases the cause
is clearly the IMF / World Bank / WTO triad.
Some casual reading:
http://www.gregpalast.com/printerfriendly.cfm?artid=96
Zack
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:18 ` Z Smith
@ 2004-11-01 11:05 ` Alan Cox
0 siblings, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:05 UTC (permalink / raw)
To: Z Smith; +Cc: Linux Kernel Mailing List
On Sul, 2004-10-31 at 20:18, Z Smith wrote:
> My laptop's framebuffer is only 800x600x24bpp VESA, or 1406kB.
> But look at what X is doing:
X has the frame buffer mapped as reported by VESA sizing not the
minimal for the mode. (Think about RandR and you'll see why)
> root 632 6.1 17.5 22024 16440 ? S 12:05 0:17 X :0
>
> The more apps in use, the more memory is used, but at the moment
> I've only got xterm, rxvt, thunderbird, xclock and xload. My wm is
> blackbox which is using 5 megs.
Mostly shared with the other apps, you did remember to divide each page
by the number of users ?
> Also, just curious but why would memory-mapped I/O be counted
> in the memory usage anyway? Shouldn't there be a separate number
> for framebuffer memory and the like?
Actually there is probably not enough information in /proc to do the
maths on it. The kernel itself has a clear idea which vma's are not
backed by ram in the usual sense as they are marked VM_IO.
> > I've helped write tiny UI kits (take a look at nanogui for example) but
> > they don't have the flexibility of X.
>
> In my experience, most of the flexibility is not necessary for
> 97% of what I do, yet it evidently costs a lot in memory usage
> and speed.
So my X server is 1Mb larger because I can run networked apps and play
bzflag. Suits me as a tradeoff - I'm not saying it always is the right
decision - nanogui works well in restricted environments like video
recorders for example.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:15 ` Theodore Ts'o
2004-10-31 20:21 ` Jeff Garzik
2004-10-31 21:06 ` Jan Engelhardt
@ 2004-11-01 11:27 ` Alan Cox
2004-11-01 13:40 ` Denis Vlasenko
2 siblings, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:27 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Jeff Garzik, Linux Kernel Mailing List
On Sul, 2004-10-31 at 20:15, Theodore Ts'o wrote:
> .... if you don't mind bloating your application:
>
> % ls -l /usr/lib/libxml2.a
> 4224 -rw-r--r-- 1 root root 4312536 Oct 19 21:55 /usr/lib/libxml2.a
Except that
1. The file size has nothing to do with the binary size as it is full of
symbols and maybe debug
2. Most of the pages of libxml2.so don't get paged in by a typical
application
3. If you have existing apps using it then its cost to you is nearly
zero because its already loaded.
libxml2 is a very complete validating all singing all dancing XML
parser. There are small non-validating parsers without every conceivable
glue interface that come down to about 10K.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 21:13 ` Jan Engelhardt
2004-10-31 21:48 ` Z Smith
@ 2004-11-01 11:29 ` Alan Cox
2004-11-01 12:36 ` Jan Engelhardt
1 sibling, 1 reply; 99+ messages in thread
From: Alan Cox @ 2004-11-01 11:29 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Z Smith, Linux Kernel Mailing List
On Sul, 2004-10-31 at 21:13, Jan Engelhardt wrote:
> Whatever you do, 3D at the software level is slow, even with a fast comp.
> See MESA.
If you are willing to lose a few bits of OpenGL you can do 3D pretty
fast in software for gaming. Take a look at stuff like TinyGL
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-11-01 11:29 ` Alan Cox
@ 2004-11-01 12:36 ` Jan Engelhardt
0 siblings, 0 replies; 99+ messages in thread
From: Jan Engelhardt @ 2004-11-01 12:36 UTC (permalink / raw)
To: Alan Cox; +Cc: Z Smith, Linux Kernel Mailing List
>> Whatever you do, 3D at the software level is slow, even with a fast comp.
>> See MESA.
>
>If you are willing to lose a few bits of OpenGL you can do 3D pretty
>fast in software for gaming. Take a look at stuff like TinyGL
Ok, you're right. But to be honest, it does not need to be GL. Just look at
UnrealTournament (runs fine on a PII W98 w/233MHz, in software mode!)
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, www.gwdg.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-11-01 11:27 ` Alan Cox
@ 2004-11-01 13:40 ` Denis Vlasenko
2004-11-01 23:04 ` Alan Cox
0 siblings, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-11-01 13:40 UTC (permalink / raw)
To: Alan Cox, Theodore Ts'o; +Cc: Jeff Garzik, Linux Kernel Mailing List
On Monday 01 November 2004 13:27, Alan Cox wrote:
> 2. Most of the pages of libxml2.so don't get paged in by a typical
> application
This assumes that 'needed' functions are close together.
This can be easily not the case, so you end up using only
a fraction of fetched page's content.
Also this argument tend to defend library growth.
"It's mostly unused, don't worry". What if that
is not true? How to compare RAM footprint
of new versus old lib in this case?
Just believe that it didn't get worse?
This can't be checked easily:
even -static compile can fail to help.
glibc produce nearly 400kb executable for
int main() { return 0; }
because init code uses printf on error paths and
that pulls i18n in. How many kilobytes is really
runs - who knows...
--
vda
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 20:53 ` Z Smith
2004-10-31 23:35 ` Rogério Brito
@ 2004-11-01 14:48 ` Diego Calleja
2004-11-01 15:09 ` [OT] " Russell Miller
1 sibling, 1 reply; 99+ messages in thread
From: Diego Calleja @ 2004-11-01 14:48 UTC (permalink / raw)
To: Z Smith; +Cc: linux-kernel
El Sun, 31 Oct 2004 12:53:21 -0800 Z Smith <plinius@comcast.net> escribió:
> But not everyone can tolerate today's level of bloat.
Sadly it's true, but in the other hand I haven't seen something like gnome/kde
which don't eats lots of resources (mac os x and XP are not better, beos was
better they say), which makes me think that building a desktop environment
without eating lots of resources is not easy. Well, and your projct is also
bloat in some ways...it's small and all that but putting a graphics system
inside the kernel is one of the best definitions of "bloat" you can find...
^ permalink raw reply [flat|nested] 99+ messages in thread
* [OT] Re: code bloat [was Re: Semaphore assembly-code bug]
2004-11-01 14:48 ` Diego Calleja
@ 2004-11-01 15:09 ` Russell Miller
0 siblings, 0 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-01 15:09 UTC (permalink / raw)
To: linux-kernel
On Monday 01 November 2004 08:48, Diego Calleja wrote:
> Sadly it's true, but in the other hand I haven't seen something like
> gnome/kde which don't eats lots of resources (mac os x and XP are not
> better, beos was better they say)
Part of the problem with KDE is the QT library underneath it all. QT 4 is
supposed to be leaner and faster. The KDE folks seem to be trying pretty
hard to reduce bloat whenever possible. But when you have software that's
expected to have the kitchen sink, it's especially challenging to reduce the
footprint while keeping all of the functionality.
I use openbox on my laptop. It's nothing near KDE in terms of functionality,
but it also runs reasonably snappy on a Pentium 266, so I can't complain too
much.
So far I'm pretty glad that the linux kernel developers have resisted putting
graphics calls and routines into the kernel. It slows things down a bit, but
I'd like to think you guys have learned from MS's mistakes. IMO one of the
biggest mistakes they ever made was to pollute the NT kernel with the
graphics subsystem. That said, FBUI looks like an interesting add-on
project.
Enough of my off topic ranting...
--Russell
--
Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-10-31 6:49 ` Jan Engelhardt
2004-10-31 21:09 ` Z Smith
@ 2004-11-01 15:17 ` Lee Revell
2004-11-01 16:56 ` Kristian Høgsberg
1 sibling, 1 reply; 99+ messages in thread
From: Lee Revell @ 2004-11-01 15:17 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: linux-kernel, xorg
On Sun, 2004-10-31 at 07:49 +0100, Jan Engelhardt wrote:
> Z Smith wrote:
> >Or join me in my effort to limit bloat. Why use an X server
> >that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
> >of code with very minimal kmallocing?
>
> FBUI does not have 3d acceleration?
Um I don't think chucking X is the answer. The problem is that it's
embarassingly slow compared to any modern GUI. If the display were as
snappy as WinXP I don't care if it's 200MB. On my desktop I constantly
see windows redrawing every freaking widget in situations where XP would
just blit from an offscreen buffer or something.
Anyway please keep replies off LKML and on the Xorg list...
Lee
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-11-01 15:17 ` Lee Revell
@ 2004-11-01 16:56 ` Kristian Høgsberg
0 siblings, 0 replies; 99+ messages in thread
From: Kristian Høgsberg @ 2004-11-01 16:56 UTC (permalink / raw)
To: Discuss issues related to the xorg tree; +Cc: Jan Engelhardt, linux-kernel
Lee Revell wrote:
> On Sun, 2004-10-31 at 07:49 +0100, Jan Engelhardt wrote:
>
>>Z Smith wrote:
>>
>>>Or join me in my effort to limit bloat. Why use an X server
>>>that uses 15-30 megs of RAM when you can use FBUI which is 25 kilobytes
>>>of code with very minimal kmallocing?
>>
>>FBUI does not have 3d acceleration?
>
>
> Um I don't think chucking X is the answer. The problem is that it's
> embarassingly slow compared to any modern GUI. If the display were as
> snappy as WinXP I don't care if it's 200MB. On my desktop I constantly
> see windows redrawing every freaking widget in situations where XP would
> just blit from an offscreen buffer or something.
>
> Anyway please keep replies off LKML and on the Xorg list...
Actually, please keep replies off the Xorg list as well.
Thanks,
Kristian
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: code bloat [was Re: Semaphore assembly-code bug]
2004-11-01 13:40 ` Denis Vlasenko
@ 2004-11-01 23:04 ` Alan Cox
0 siblings, 0 replies; 99+ messages in thread
From: Alan Cox @ 2004-11-01 23:04 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: Theodore Ts'o, Jeff Garzik, Linux Kernel Mailing List
On Llu, 2004-11-01 at 13:40, Denis Vlasenko wrote:
> This assumes that 'needed' functions are close together.
> This can be easily not the case, so you end up using only
> a fraction of fetched page's content.
And gprof will help you sort that out, along with -ffunction-sections
you can do pretty fine grained tidying
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 22:16 ` linux-os
2004-11-01 22:26 ` Linus Torvalds
2004-11-03 1:52 ` Horst von Brand
@ 2004-11-03 21:24 ` Bill Davidsen
2 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-03 21:24 UTC (permalink / raw)
To: linux-os
Cc: Linus Torvalds, dean gaudet, Andreas Steinmetz,
Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
linux-os wrote:
> You just don't get it. I, too, can make a so-called bench-mark
> that will "prove" something that's so incredibly invalid that
> it shouldn't even deserve an answer. However, because you
> are supposed to know what you are doing, I will give you
> an answer.
>
> It is totally impossible to perform useful work with memory,
> i.e., poping the value of something from memory into a register,
> without incurring the cost of that memory access. It doesn't
> matter if the memory is in cache or if it needs to be read
> using the memory controller. Time is time and it never runs
> backwards. I spend most of my days with hardware logic analyzers
> looking at the memory accesses so I damn-well know what I
> am taking about. That memory-access takes a time-slot that
> something else can't use. You never get it back. It is gone
> forever. This is very important to understand. If you don't
> understand this, you can fall into the "black-magic" trap.
>
> Modern CPUs make it easy for so-called software engineers to
> perceive so-called facts that are not, in fact, true. Because
> it is possible for the CPU to perform memory-access independent
> of instruction sequence (so-called parallel operations), it is
> possible to make bench-marks that prove nothing, but seem to
> show that a read from memory is free. It can never be free. It
> will eventually show up. It was just deferred. Of course, if
> your computer is just going to run that single bench-mark, then
> return to a prompt, you can readily become victum of a very
> common error because there is now plenty of time available to
> just spin (or wait for an interrupt).
>
> So, if you really want to make things fast, you keep your
> memory accesses to the absolute minimum. Poping something
> from the stack is the antithesis of what you want to do.
>
> It's really amusing. Software development has devolved
> into some black magic where logic, mathematics, and
> physical testing no longer thrive.
>
> Instead, we must listen to those who profess to know
> about this magic because of some innate enlightenment
> imparted to those favored few who are able to perceive
> the trueness of their intellectual perception without
> regard for contrary physical observations.
>
> It's wonderful to not be bothered by tests, measurements,
> documentation, or other facts.
>
> Wake up and don't be dragged into the black-magic trap.
The election is over, we can adopt a civil non-confrontational tone
again... Linus is not always right, but like most people he responds
better to "let me give you additional information" than "I know more
than you, take my word for it."
In this case, I think Dick does have a point on memory to cache use. It
appears from what little stuff I have here that with HT cache access is
serialized, and that memory access, even to L1 cache, might under some
circumstances be delayed. I won't guess if you would ever see that in
practice.
Getting information out of noisy measurements is not easy, and while
Dick is probably right that the lowest time is the "real" time, if the
average is lower doing something else, isn't that what we want?
My response test reports low, high, average, median, and 90th percentile
values, and depending on whether you want the best average, best
typical, or worst case avoidance you might find any of them useful. Oh,
and S.D. of the data to hint on how much you trust the results. I don't
think any of the test programs produce the definitive result, and I see
that results change depending on the CPU.
I think there are a lot of things more deserving this level of
consideration.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 22:16 ` linux-os
2004-11-01 22:26 ` Linus Torvalds
@ 2004-11-03 1:52 ` Horst von Brand
2004-11-03 21:24 ` Bill Davidsen
2 siblings, 0 replies; 99+ messages in thread
From: Horst von Brand @ 2004-11-03 1:52 UTC (permalink / raw)
To: linux-os
Cc: Linus Torvalds, dean gaudet, Andreas Steinmetz,
Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
linux-os <linux-os@chaos.analogic.com> said:
[...]
> Instead, we must listen to those who profess to know
> about this magic because of some innate enlightenment
> imparted to those favored few who are able to perceive
> the trueness of their intellectual perception without
> regard for contrary physical observations.
Right. Just go and tell that to somebody who actually designed one of the
competing CPU's inards. And who probably learnt nothing whatsowever on the
ones it was mimiking in the process.
> It's wonderful to not be bothered by tests, measurements,
> documentation, or other facts.
How true.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-02 16:06 ` Linus Torvalds
@ 2004-11-02 16:51 ` linux-os
0 siblings, 0 replies; 99+ messages in thread
From: linux-os @ 2004-11-02 16:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Tue, 2 Nov 2004, Linus Torvalds wrote:
>
>
> On Tue, 2 Nov 2004, Linus Torvalds wrote:
>>
>> Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this
>> is from memory) back to a "0"
>
> .. or just use the current -bk snapshot, actually. I may not have x86 as
> my main desktop, but it's not like I had a really hard time finding one
> (like the laptop laying there right on top of the desk ;), so the fixed
> version got checked in already.
>
> Linus
Okay. I got linux-2.6.9 back up.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-02 16:02 ` Linus Torvalds
@ 2004-11-02 16:06 ` Linus Torvalds
2004-11-02 16:51 ` linux-os
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-02 16:06 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Tue, 2 Nov 2004, Linus Torvalds wrote:
>
> Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this
> is from memory) back to a "0"
.. or just use the current -bk snapshot, actually. I may not have x86 as
my main desktop, but it's not like I had a really hard time finding one
(like the laptop laying there right on top of the desk ;), so the fixed
version got checked in already.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-02 15:02 ` linux-os
@ 2004-11-02 16:02 ` Linus Torvalds
2004-11-02 16:06 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-02 16:02 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Tue, 2 Nov 2004, linux-os wrote:
>
> The patch you provided patched without any rejects. However,
> the system won't boot.
Yes, there was a incorrect change to the "asmlinkage" definition that I
had played with before deciding to make just the semaphores be reg-arg,
and that change made it into my original patch by mistake. I sent out a
second message asking people to remove that part of the patch some time
later, but..
> I patched Linux-2.6.9. Could you please review your patch?
> I will await the possibility of a simple typo that I can
> fix by hand before reverting.
Just change the incorrect "3" in <asm-i386/linkage.h> (or whatever, this
is from memory):
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))
back to a "0". Asmlinkage still uses stack-based parameter passing, which
I'd love to fix eventually (we've had bugs in that area too), but it is
just too much pain to do right now.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 21:46 ` Linus Torvalds
@ 2004-11-02 15:02 ` linux-os
2004-11-02 16:02 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-11-02 15:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
Linus,
The patch you provided patched without any rejects. However,
the system won't boot. It will not even get to
"Uncompressing Linux". After the GRUB loader sign-on,
the interrupts just remain disabled (no caps-lock or num-lock
change on the keyboard).
I patched Linux-2.6.9. Could you please review your patch?
I will await the possibility of a simple typo that I can
fix by hand before reverting.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.8 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 21:40 ` Linus Torvalds
2004-11-01 21:46 ` Linus Torvalds
2004-11-01 22:16 ` linux-os
@ 2004-11-02 6:37 ` Chris Friesen
2 siblings, 0 replies; 99+ messages in thread
From: Chris Friesen @ 2004-11-02 6:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
> On Intel, if I recall correctly, rdtsc is totally serializing, so you're
> testing not just the instructions between the rdtsc's, but the length of
> the pipeline, and the time it takes for stuff around it to calm down.
Actually, the Intel docs say that rdtsc is not serializing (specifically for the
P6 series, but linked off the P4 section of the site) and their sample
performance measuring code for the P4 shows it using a serializing instruction
before the call to rdtsc.
Chris
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 23:14 ` linux-os
@ 2004-11-01 23:42 ` Linus Torvalds
0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 23:42 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, linux-os wrote:
>
> No. You've just shown that you like to argue. I recall that you
> recently, like within the past 24 hours, supplied a patch that
> got rid of the time-consuming stack operations in your semaphore
> code. Remember, you changed it to pass parameters in registers.
... because that fixed a _bug_.
> Why would you bother if stack operations are free?
I didn't say that instructions are free. I just tried (unsuccessfully) to
tell you that "lea" is not free either, and that "lea" has some serious
problems on several setups, ranging from old cpu's (AGI stalls) to new
CPU's (stack engine stalls). And that "pop" is often faster.
And you have been arguing against it despite the fact that I ended up
writing a small test-program to show that it's true. It's a _stupid_
test-program, but the fact is, you only need a single test-case to prove
some theory wrong.
Your theory that "lea" is somehow always cheaper than "pop" is wrong.
> It's not a total focus. It's just necessary emphasis. Any work
> done by your computer, ultimately comes from and goes to memory.
Not so.
A lot of work is done in cache. Any access that doesn't change the state
of the cache is a no-op as far as the memory bus is concerned. Ie a store
to a cacheline that is already dirty is just a cache access, as is a load
from a cacheline that is already loaded.
This is especially true on x86 CPU's, where the lack of registers means
that the core has been highly optimized for doing cached operations.
Remember: a CPU is not some kind of abstract entity - it's a very
practical piece of engineering that has been highly optimized for certain
usage patterns.
And the fact is, "lea" on %esp is not a common usage pattern. Which is
why, in practice, you will find CPU's that end up not optimizing for it.
While "pop"+"pop" is a _very_ common pattern, and why existing CPU's
do them efficiently.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 22:26 ` Linus Torvalds
@ 2004-11-01 23:14 ` linux-os
2004-11-01 23:42 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-11-01 23:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, Linus Torvalds wrote:
>
>
> On Mon, 1 Nov 2004, linux-os wrote:
>>
>> You just don't get it. I, too, can make a so-called bench-mark
>> that will "prove" something that's so incredibly invalid that
>> it shouldn't even deserve an answer.
>
> *Plonk*
>
> You've just shown that not only do you ignore well-educated people who
> tell you why pipelines can have trouble with "lea", you also ignore hard
> numbers.
>
No. You've just shown that you like to argue. I recall that you
recently, like within the past 24 hours, supplied a patch that
got rid of the time-consuming stack operations in your semaphore
code. Remember, you changed it to pass parameters in registers.
Why would you bother if stack operations are free? The fact is
that you know that even a single extra memory access (i.e., a
stack operation) is costly. You just don't want to admit that
(remember the original premise if this discussion) popping
into an unused register to level the stack, is NOT better than
adding to the stack-pointer or, as another learned engineer
advised, using LEA instead.
I simply wrote some code that showed that poping registers used
more CPU cycles than adding to the stack-pointer, and using
LEA instead of the ADD showed no difference. Of course I
was immediately overwhelmed by responses that the benchmark
was invalid, presumably because it wasn't written by somebody
else.
> Your total focus on a cached memory access as being somehow more expensive
> than anything else going in the CPU pipeline is sad.
>
It's not a total focus. It's just necessary emphasis. Any work
done by your computer, ultimately comes from and goes to memory.
Some is memory-mapped hardware "memory" some is simply RAM.
Managing those memory accesses is very important when it comes
to maximizing the work that your computer can do in a limited
period of time. Wasting memory-access time is something one
should not do if at all possible.
> But hey, I've run out of ways to show you wrong. If you believe the world
> is flat, that's your problem.
>
> Linus
>
No, the world is crooked, not flat.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 22:16 ` linux-os
@ 2004-11-01 22:26 ` Linus Torvalds
2004-11-01 23:14 ` linux-os
2004-11-03 1:52 ` Horst von Brand
2004-11-03 21:24 ` Bill Davidsen
2 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 22:26 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, linux-os wrote:
>
> You just don't get it. I, too, can make a so-called bench-mark
> that will "prove" something that's so incredibly invalid that
> it shouldn't even deserve an answer.
*Plonk*
You've just shown that not only do you ignore well-educated people who
tell you why pipelines can have trouble with "lea", you also ignore hard
numbers.
Your total focus on a cached memory access as being somehow more expensive
than anything else going in the CPU pipeline is sad.
But hey, I've run out of ways to show you wrong. If you believe the world
is flat, that's your problem.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 21:23 ` dean gaudet
@ 2004-11-01 22:22 ` linux-os
0 siblings, 0 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 22:22 UTC (permalink / raw)
To: dean gaudet
Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, dean gaudet wrote:
> On Mon, 1 Nov 2004, linux-os wrote:
>
>> On Mon, 1 Nov 2004, dean gaudet wrote:
>>
>>> On Sun, 31 Oct 2004, linux-os wrote:
>>>
>>>> Timer overhead = 88 CPU clocks
>>>> push 3, pop 3 = 12 CPU clocks
>>>> push 3, pop 2 = 12 CPU clocks
>>>> push 3, pop 1 = 12 CPU clocks
>>>> push 3, pop none using ADD = 8 CPU clocks
>>>> push 3, pop none using LEA = 8 CPU clocks
>>>> push 3, pop into same register = 12 CPU clocks
>>>
>>> your microbenchmark makes assumptions about rdtsc which haven't been valid
>>> since the days of the 486. rdtsc has serializing aspects and overhead that
>>> you can't just eliminate by running it in a tight loop and subtracting out
>>> that "overhead".
>>>
>>
>> Wrong.
>
> if you were correct then i should be able to measure 1 cycle differences
> in sequences such as the following:
[SNIPPED...]
Who said? The resolution isn't even specified. Experimental
results with several different processors seem to show that
the resolution is about 4 cycles.
Script started on Mon 01 Nov 2004 04:48:04 PM EST
# ./tester
Timer overhead = 88 CPU clocks
1 nop = 4 CPU clocks
2 nops = 4 CPU clocks
3 nops = 4 CPU clocks
4 nops = 8 CPU clocks
5 nops = 8 CPU clocks
6 nops = 8 CPU clocks
7 nops = 8 CPU clocks
8 nops = 12 CPU clocks
# exit
Script done on Mon 01 Nov 2004 04:48:34 PM EST
Assembly :
nop8: nop
nop7: nop
nop6: nop
nop5: nop
nop4: nop
nop3: nop
nop2: nop
nop1: nop
ret
.global nop1
.global nop2
.global nop3
.global nop4
.global nop5
.global nop6
.global nop7
.global nop8
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 21:40 ` Linus Torvalds
2004-11-01 21:46 ` Linus Torvalds
@ 2004-11-01 22:16 ` linux-os
2004-11-01 22:26 ` Linus Torvalds
` (2 more replies)
2004-11-02 6:37 ` Chris Friesen
2 siblings, 3 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 22:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, Linus Torvalds wrote:
>
>
> On Mon, 1 Nov 2004, linux-os wrote:
>>
>> Wrong.
>>
>> (1) The '486 didn't have the rdtsc instruction.
>> (2) There are no 'serializing' or other black-magic aspects of
>> using the internal cycle-counter. That's exactly how you you
>> can benchmark the execution time of accessible code sequences.
>
> Sorry, but you shouldn't argue with people who know more than you do. I
> know Dean, and he analyzes things for work, and does know what he is
> doing.
>
> "rdtsc" _does_ partly serialize things, and it's not even architecturally
> defined, so you'll find that it serializes things in different ways on
> different CPU's. You can't just do
>
> rdtsc
> ...
> rdtsc
>
> and expect the stuff in between the rdtsc's to be timed exactly: some of
> it will overlap with the rdtsc's, some of it won't.
>
> On Intel, if I recall correctly, rdtsc is totally serializing, so you're
> testing not just the instructions between the rdtsc's, but the length of
> the pipeline, and the time it takes for stuff around it to calm down.
> Which is why two rdtsc's in sequence will show quite a lot of overhead on
> a P4 (something like 80 cycles).
>
> So you really want to do more operations in between the rdtsc's.
>
> Try the appended program. On a P4, the two sequnces are the same for me
> (92 cycles, 80 cycles overhead), while on a Pentium M, the sequence of two
> popl's (57 cycles) is faster than the sequence of "lea+popl" (59 cycles)
> and the overhead is 47 cycles.
>
> So can you _please_ just admit that you were wrong? On a P4, the pop/pop
> is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
> according to this test. Your contention that "pop" has to be slower than
> "lea" is WRONG.
>
> Linus
>
> ----
> #define PUSHEBX "pushl %%ebx\n\t"
> #define PUSHECX "pushl %%ecx\n\t"
> #define POPECX "popl %%ecx\n\t"
> #define POPEBX "popl %%ebx\n\t"
>
> #ifdef TEST_LEA
>
> #undef POPECX
> #define POPECX "leal 4(%%esp),%%esp\n\t"
>
> #endif
>
> #ifdef TEST_OVERHEAD
>
> #undef PUSHEBX
> #undef PUSHECX
> #undef POPEBX
> #undef POPECX
>
> #define PUSHEBX
> #define PUSHECX
> #define POPEBX
> #define POPECX
>
> #endif
>
> int main(void)
> {
> unsigned long start;
> unsigned long long end;
>
> asm volatile(
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> PUSHEBX
> PUSHECX
> "rdtsc\n\t"
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> POPECX
> POPEBX
> "movl %%eax,%%esi\n\t"
> "rdtsc"
> :"=A" (end), "=S" (start));
> printf("%ld cycles\n", (long) end-start);
> }
>
You just don't get it. I, too, can make a so-called bench-mark
that will "prove" something that's so incredibly invalid that
it shouldn't even deserve an answer. However, because you
are supposed to know what you are doing, I will give you
an answer.
It is totally impossible to perform useful work with memory,
i.e., poping the value of something from memory into a register,
without incurring the cost of that memory access. It doesn't
matter if the memory is in cache or if it needs to be read
using the memory controller. Time is time and it never runs
backwards. I spend most of my days with hardware logic analyzers
looking at the memory accesses so I damn-well know what I
am taking about. That memory-access takes a time-slot that
something else can't use. You never get it back. It is gone
forever. This is very important to understand. If you don't
understand this, you can fall into the "black-magic" trap.
Modern CPUs make it easy for so-called software engineers to
perceive so-called facts that are not, in fact, true. Because
it is possible for the CPU to perform memory-access independent
of instruction sequence (so-called parallel operations), it is
possible to make bench-marks that prove nothing, but seem to
show that a read from memory is free. It can never be free. It
will eventually show up. It was just deferred. Of course, if
your computer is just going to run that single bench-mark, then
return to a prompt, you can readily become victum of a very
common error because there is now plenty of time available to
just spin (or wait for an interrupt).
So, if you really want to make things fast, you keep your
memory accesses to the absolute minimum. Poping something
from the stack is the antithesis of what you want to do.
It's really amusing. Software development has devolved
into some black magic where logic, mathematics, and
physical testing no longer thrive.
Instead, we must listen to those who profess to know
about this magic because of some innate enlightenment
imparted to those favored few who are able to perceive
the trueness of their intellectual perception without
regard for contrary physical observations.
It's wonderful to not be bothered by tests, measurements,
documentation, or other facts.
Wake up and don't be dragged into the black-magic trap.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 21:40 ` Linus Torvalds
@ 2004-11-01 21:46 ` Linus Torvalds
2004-11-02 15:02 ` linux-os
2004-11-01 22:16 ` linux-os
2004-11-02 6:37 ` Chris Friesen
2 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 21:46 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, Linus Torvalds wrote:
>
> So can you _please_ just admit that you were wrong? On a P4, the pop/pop
> is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
> according to this test. Your contention that "pop" has to be slower than
> "lea" is WRONG.
Btw, I'd like to emphasize "this test". Modern OoO CPU's are complex
animals. They have pipeline quirks etc that just means that things depend
on alignment, on code around it, and on register usage patterns of the
instructions that you test _and_ the instructions around those
instructions. So take any proof with a pinch of salt, because there are
bound to be other circumstances where factors around the code just change
the assumptions.
In short, any time you're looking at single cycle timings, you should be
very aware of the fact that your measurements are suspect. The best way to
avoid most of the problem is to never try to measure single cycles.
Measure performance on a program, not on a single instruction.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 20:52 ` linux-os
2004-11-01 21:23 ` dean gaudet
@ 2004-11-01 21:40 ` Linus Torvalds
2004-11-01 21:46 ` Linus Torvalds
` (2 more replies)
1 sibling, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 21:40 UTC (permalink / raw)
To: linux-os
Cc: dean gaudet, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, linux-os wrote:
>
> Wrong.
>
> (1) The '486 didn't have the rdtsc instruction.
> (2) There are no 'serializing' or other black-magic aspects of
> using the internal cycle-counter. That's exactly how you you
> can benchmark the execution time of accessible code sequences.
Sorry, but you shouldn't argue with people who know more than you do. I
know Dean, and he analyzes things for work, and does know what he is
doing.
"rdtsc" _does_ partly serialize things, and it's not even architecturally
defined, so you'll find that it serializes things in different ways on
different CPU's. You can't just do
rdtsc
...
rdtsc
and expect the stuff in between the rdtsc's to be timed exactly: some of
it will overlap with the rdtsc's, some of it won't.
On Intel, if I recall correctly, rdtsc is totally serializing, so you're
testing not just the instructions between the rdtsc's, but the length of
the pipeline, and the time it takes for stuff around it to calm down.
Which is why two rdtsc's in sequence will show quite a lot of overhead on
a P4 (something like 80 cycles).
So you really want to do more operations in between the rdtsc's.
Try the appended program. On a P4, the two sequnces are the same for me
(92 cycles, 80 cycles overhead), while on a Pentium M, the sequence of two
popl's (57 cycles) is faster than the sequence of "lea+popl" (59 cycles)
and the overhead is 47 cycles.
So can you _please_ just admit that you were wrong? On a P4, the pop/pop
is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
according to this test. Your contention that "pop" has to be slower than
"lea" is WRONG.
Linus
----
#define PUSHEBX "pushl %%ebx\n\t"
#define PUSHECX "pushl %%ecx\n\t"
#define POPECX "popl %%ecx\n\t"
#define POPEBX "popl %%ebx\n\t"
#ifdef TEST_LEA
#undef POPECX
#define POPECX "leal 4(%%esp),%%esp\n\t"
#endif
#ifdef TEST_OVERHEAD
#undef PUSHEBX
#undef PUSHECX
#undef POPEBX
#undef POPECX
#define PUSHEBX
#define PUSHECX
#define POPEBX
#define POPECX
#endif
int main(void)
{
unsigned long start;
unsigned long long end;
asm volatile(
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
PUSHEBX
PUSHECX
"rdtsc\n\t"
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
POPECX
POPEBX
"movl %%eax,%%esi\n\t"
"rdtsc"
:"=A" (end), "=S" (start));
printf("%ld cycles\n", (long) end-start);
}
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 20:52 ` linux-os
@ 2004-11-01 21:23 ` dean gaudet
2004-11-01 22:22 ` linux-os
2004-11-01 21:40 ` Linus Torvalds
1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-11-01 21:23 UTC (permalink / raw)
To: linux-os
Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, linux-os wrote:
> On Mon, 1 Nov 2004, dean gaudet wrote:
>
> > On Sun, 31 Oct 2004, linux-os wrote:
> >
> > > Timer overhead = 88 CPU clocks
> > > push 3, pop 3 = 12 CPU clocks
> > > push 3, pop 2 = 12 CPU clocks
> > > push 3, pop 1 = 12 CPU clocks
> > > push 3, pop none using ADD = 8 CPU clocks
> > > push 3, pop none using LEA = 8 CPU clocks
> > > push 3, pop into same register = 12 CPU clocks
> >
> > your microbenchmark makes assumptions about rdtsc which haven't been valid
> > since the days of the 486. rdtsc has serializing aspects and overhead that
> > you can't just eliminate by running it in a tight loop and subtracting out
> > that "overhead".
> >
>
> Wrong.
if you were correct then i should be able to measure 1 cycle differences
in sequences such as the following:
rdtsc
mov %eax,%edi
shr $1,%ecx
rdtsc
rdtsc
mov %eax,%edi
shr $1,%ecx
shr $1,%ecx
rdtsc
...
rdtsc
mov %eax,%edi
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
shr $1,%ecx
rdtsc
yet the attached program demonstrates that such measurements are
inaccurate. the results should be a sequence of numbers increasing
by 1 each time.
p4 model 2: 80 80 84 84 84 84 84 84
p4 model 3: 120 120 120 120 120 120 120 128
p-m model 9: 47 46 47 48 49 50 56 57
k8: 5 5 5 5 5 5 5 5
-dean
% gcc -O -o rdtsc-rounding rdtsc-rounding.c
rdtsc-rounding.c:
#include <stdio.h>
#include <stdint.h>
#define template(n) \
static uint32_t foo##n(void) \
{ \
uint32_t start, done, trash1, trash2; \
\
__asm volatile( \
"\n rdtsc" \
"\n mov %%eax,%0" \
x##n("\n shr $1,%1") \
"\n rdtsc" \
: "=&r" (start), "=&r" (trash1), "=&a" (done), "=&d" (trash2) \
); \
return done - start; \
}
#define x1(x) x
#define x2(x) x x
#define x3(x) x x x
#define x4(x) x2(x) x2(x)
#define x5(x) x4(x) x
#define x6(x) x3(x2(x))
#define x7(x) x6(x) x
#define x8(x) x4(x2(x))
template(1)
template(2)
template(3)
template(4)
template(5)
template(6)
template(7)
template(8)
static uint32_t (*fn[9])(void) = {
0, foo1, foo2, foo3, foo4, foo5, foo6, foo7, foo8
};
static uint32_t bench(uint32_t (*f)(void))
{
uint32_t best;
unsigned i;
best = ~0;
for (i = 0; i < 100000; ++i) {
uint32_t cur = f();
if (cur < best) {
best = cur;
}
}
return best;
}
int main(int argc, char **argv)
{
unsigned i;
for (i = 1; i < sizeof(fn)/sizeof(fn[0]); ++i) {
printf("%u ", bench(fn[i]));
}
printf("\n");
return 0;
}
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 20:23 ` dean gaudet
@ 2004-11-01 20:52 ` linux-os
2004-11-01 21:23 ` dean gaudet
2004-11-01 21:40 ` Linus Torvalds
0 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 20:52 UTC (permalink / raw)
To: dean gaudet
Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Mon, 1 Nov 2004, dean gaudet wrote:
> On Sun, 31 Oct 2004, linux-os wrote:
>
>> Timer overhead = 88 CPU clocks
>> push 3, pop 3 = 12 CPU clocks
>> push 3, pop 2 = 12 CPU clocks
>> push 3, pop 1 = 12 CPU clocks
>> push 3, pop none using ADD = 8 CPU clocks
>> push 3, pop none using LEA = 8 CPU clocks
>> push 3, pop into same register = 12 CPU clocks
>
> your microbenchmark makes assumptions about rdtsc which haven't been valid
> since the days of the 486. rdtsc has serializing aspects and overhead that
> you can't just eliminate by running it in a tight loop and subtracting out
> that "overhead".
>
Wrong.
(1) The '486 didn't have the rdtsc instruction.
(2) There are no 'serializing' or other black-magic aspects of
using the internal cycle-counter. That's exactly how you you
can benchmark the execution time of accessible code sequences.
> you have to run your inner loops at least a few thousand of times between
> rdtsc invocations and divide it out to find out the average cost in order to
> eliminate the problems associated with rdtsc.
>
> -dean
>
You never average the cycle-time. The cycle-time is absolute.
You need to remove the affect of interrupts when you measure
performance so you need to sample a few times and save the
lowest number. That's the number obtained during the testing interval,
was not interrupted.
The provided code allows you to experiment. You can set the
TRIES count to 1. You will find that the results are noisy if
you are connected to an active network. Good results can be
obtained with it set to 4 if your computer is not being blasted
with lots of broadcast packets from M$ servers.
>
Of course you are not really interested in learning anything
about this are you?
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 1:31 ` linux-os
2004-11-01 5:49 ` Linus Torvalds
@ 2004-11-01 20:23 ` dean gaudet
2004-11-01 20:52 ` linux-os
1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-11-01 20:23 UTC (permalink / raw)
To: linux-os
Cc: Linus Torvalds, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
[-- Attachment #1: Type: TEXT/PLAIN, Size: 757 bytes --]
On Sun, 31 Oct 2004, linux-os wrote:
> Timer overhead = 88 CPU clocks
> push 3, pop 3 = 12 CPU clocks
> push 3, pop 2 = 12 CPU clocks
> push 3, pop 1 = 12 CPU clocks
> push 3, pop none using ADD = 8 CPU clocks
> push 3, pop none using LEA = 8 CPU clocks
> push 3, pop into same register = 12 CPU clocks
your microbenchmark makes assumptions about rdtsc which haven't been valid
since the days of the 486. rdtsc has serializing aspects and overhead
that you can't just eliminate by running it in a tight loop and
subtracting out that "overhead".
you have to run your inner loops at least a few thousand of times between
rdtsc invocations and divide it out to find out the average cost in order
to eliminate the problems associated with rdtsc.
-dean
[-- Attachment #2: Type: APPLICATION/X-GZIP, Size: 6806 bytes --]
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-11-01 1:31 ` linux-os
@ 2004-11-01 5:49 ` Linus Torvalds
2004-11-01 20:23 ` dean gaudet
1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-11-01 5:49 UTC (permalink / raw)
To: linux-os
Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Sun, 31 Oct 2004, linux-os wrote:
>
> The attached file shows that the Intel Pentium 4 runs exactly as I
> described. Further, there is no difference in the CPU clocks used when
> adding a constant to the stack- pointer or using LEA.
Goodie. You found _one_ CPU that you think matters. On ethat doesn't even
have the hardware that I've described. And you ignore all the other
evidence.
Good for you.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:12 ` Linus Torvalds
@ 2004-11-01 1:31 ` linux-os
2004-11-01 5:49 ` Linus Torvalds
2004-11-01 20:23 ` dean gaudet
0 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-11-01 1:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2280 bytes --]
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>
>> Linus, there is no way in hell that you are going to move
>> a value from memory into a register (pop ecx) faster than
>> you are going to do anything to the stack-pointer or
>> any other register.
>
> Sorry, but you're wrong.
I am not wrong.
I don't understand anything about your theoretical CPU
with the magic stack engine. Anything I can get my
hands on functions exactly as I described and exactly
as would be expected. We work with real hardware here
and I have to test it as part of my job.
And, FYI, I spend all my working time trying to get the
last iota of performance out of ix86 CPUS. Since I can
only read publicly available documentation, I have
to test code in actual operation.
The attached file shows that the Intel Pentium 4 runs
exactly as I described. Further, there is no difference in
the CPU clocks used when adding a constant to the stack-
pointer or using LEA.
It also shows that poping stack-data into the same register
twice, as you suggested, takes the same time as using a
different register.
Timer overhead = 88 CPU clocks
push 3, pop 3 = 12 CPU clocks
push 3, pop 2 = 12 CPU clocks
push 3, pop 1 = 12 CPU clocks
push 3, pop none using ADD = 8 CPU clocks
push 3, pop none using LEA = 8 CPU clocks
push 3, pop into same register = 12 CPU clocks
The code uses a separate assembly-language file so that
the 'C' compiler can't optimize-away what I am measuring.
It also saves and uses the shortest number of CPU cycles
so the code doesn't have to execute with the interrupts
OFF to get a stable reading.
>
> Learn about modern CPU's some day, and realize that cached accesses are
> fast, and pipeline stalls are relatively much more expensive.
>
That's what I do, and that's what I teach.
> Now, if it was uncached, you'd have a point.
>
> Also think about why
>
> call xxx
> jmp yy
>
> is often much faster than
>
> push $yy
> jmp xxx
>
> and other small interesting facts about how CPU's actually work these
> days.
>
> Linus
>
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
[-- Attachment #2: Type: APPLICATION/x-gzip, Size: 6806 bytes --]
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:42 ` Linus Torvalds
2004-10-29 18:54 ` Linus Torvalds
@ 2004-10-30 3:35 ` Jeff Garzik
1 sibling, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-30 3:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Richard Henderson, Kernel Mailing List, Andi Kleen,
Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
> Anyway, it's quite likely that for several CPU's the fastest sequence ends
> up actually being
>
> movl 4(%esp),%ecx
> movl 8(%esp),%edx
> movl 12(%esp),%eax
> addl $16,%esp
>
> which is also one of the biggest alternatives.
That's how I'm coding the sparse "compiler backend"... the mov's and
add's tend to be tiny instructions (i-cache friendly), and you can often
issue a bunch of them through multiple pipes/ports.
Jeff
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 23:50 ` dean gaudet
@ 2004-10-30 0:15 ` Linus Torvalds
0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-30 0:15 UTC (permalink / raw)
To: dean gaudet
Cc: Andreas Steinmetz, linux-os, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, dean gaudet wrote:
>
> for p4 model 0 through 2 it was faster to avoid lea and shl and generate
> code like:
>
> add %ebx,%ebx
> add %ebx,%ebx
> add %ebx,%ebx
> add %ebx,%ebx
I think that is true only for the lea's that have a shifted input. The
weakness of the original P4 is its shifter, not lea itself. And for a
simple lea like 4(%esp), it's likely no worse than a regular "add", and
there lea has the advantage that you can put the result in another
register, which can be advantageous in other circumstances.
So lea actually _is_ useful for doing adds, in many cases. Of course, on
older CPU's you'll see the effect of the address generation adder being
one cycle "off" (earlier) the regular ALU execution unit, so lea often
causes AGI stalls. I don't think this is an issue on the P6 or P4 because
of how they actually end up implementing the lea in the regular ALU path.
How the hell did we get to worrying about this in the first place?
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:40 ` Andreas Steinmetz
2004-10-29 19:56 ` Linus Torvalds
@ 2004-10-29 23:50 ` dean gaudet
2004-10-30 0:15 ` Linus Torvalds
1 sibling, 1 reply; 99+ messages in thread
From: dean gaudet @ 2004-10-29 23:50 UTC (permalink / raw)
To: Andreas Steinmetz
Cc: Linus Torvalds, linux-os, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> > On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> > > Sample quote from said manual (P/N 248966-05):
> > > "Use the lea instruction and the full range of addressing modes to do
> > > address calculation"
...
> Some more data from said manual (lea is better on P3 and the same as add on
> P4):
you really need to understand intel optimisation guides. it helps to diff
them over time to see the types of things that go in and out of fashion.
> I don't know about P4 internals but let me make some guess:
> There's lot of software around that needs to run on older processors where lea
> has quite some performance advantage. Thus I would guess that the P4 design
> respects this by handling lea x(esp),esp efficiently.
your guess is generally wrong... try measuring it.
for p4 model 0 through 2 it was faster to avoid lea and shl and generate
code like:
add %ebx,%ebx
add %ebx,%ebx
add %ebx,%ebx
add %ebx,%ebx
which would complete in 2 cycles, compared to 4 cycles for lea or a shift.
but that crap doesn't apply to any other x86 (except efficeon which
notices this crud and converts it to its own optimal sequence).
p4 model 2 is probably way more common than p4 model 3 still.
you also need to be aware of k7/k8. AMD has their own optimisation guide
(i'm too lazy to find url/#). but the important point for lea and AMD is
that it is a 2 cycle latency operation, and add is 1 cycle.
but you know what? we can talk about what the optimization guides say
until we're blue... the only thing which matters is experience. go
measure it. (i've measured a bazillion things like this.)
use pop, don't use lea to modify esp.
-dean
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:06 ` Linus Torvalds
2004-10-29 18:39 ` linux-os
2004-10-29 18:58 ` Andreas Steinmetz
@ 2004-10-29 23:37 ` dean gaudet
2 siblings, 0 replies; 99+ messages in thread
From: dean gaudet @ 2004-10-29 23:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Andreas Steinmetz, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
> On Fri, 29 Oct 2004, linux-os wrote:
> > > with the following:
> > >
> > > leal 4(%esp),%esp
> >
> > Probably so because I'm pretty certain that the 'pop' (a memory
> > access) is not going to be faster than a simple register operation.
>
> Bzzt, wrong answer.
>
> It's not "simple register operation". It's really about the fact that
> modern CPU's are smarter - yet dumber - then you think. They do things
> like speculate the value of %esp in order to avoid having to calculate it,
> and it's entirely possible that "pop" is much faster, simply because I
> guarantee you that a CPU will speculate %esp correctly across a "pop", but
> the same is not necessarily true for "lea %esp".
>
> Somebody should check what the Pentium M does. It might just notice that
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> possible that lea will confuse its stack engine logic and cause
> stack-related address generation stalls..
it's worse than that in general -- lea typically goes through the AGU
which has either less throughput or longer latency than the ALUs...
depending on which x86en. it's 4 cycles for a lea on p4, vs. 1 for a pop.
it's 2 cycles for a lea on k8 vs. 1 for a pop.
use pop.
-dean
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:56 ` Linus Torvalds
@ 2004-10-29 22:07 ` Jeff Garzik
0 siblings, 0 replies; 99+ messages in thread
From: Jeff Garzik @ 2004-10-29 22:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Steinmetz, linux-os, Kernel Mailing List,
Richard Henderson, Andi Kleen, Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
>
> popl %eax
> popl %ecx
>
> should one cycle on a Pentium. I pretty much _guarantee_ that
>
> lea 4(%esp),%esp
> popl %ecx
>
> takes longer, since they have a data dependency on %esp that is hard to
> break (the P4 trace-cache _may_ be able to break it, but the only CPU that
> I think is likely to break it is actually the Transmeta CPU's, which did
> that kind of thing by default and _will_ parallelise the two, and even
> combine the stack offsetting into one single micro-op).
One of my favorite "optimizing for Pentium" docs is
http://www.agner.org/assem/pentopt.pdf
from
http://www.agner.org/assem/
which is current through newer P4's AFAICS.
It notes on the P4 specifically that LEA is split into additions and
shifts. Not sure what it does on the P3, but I bet it generates more
uops in addition to the data dependency.
Jeff
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:20 ` Linus Torvalds
2004-10-29 19:26 ` Linus Torvalds
@ 2004-10-29 21:03 ` Linus Torvalds
1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 21:03 UTC (permalink / raw)
To: linux-os
Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
> Here's a totally untested patch to make the semaphores use "fastcall"
> instead of "asmlinkage"
Ok, I tested it, looked through the assembly code, and did a general size
comparison. Everything looks good, and it should fix the problem that
caused this discussion. Checked in.
The patch actually improves code generation by moving the failure case
argument generation _into_ the failure case: this makes the inline asm one
instruction longer, but it means that the fastpath is often one
instruction shorter. In fact, the fastpath is usually improved even _more_
than that, because gcc does sucketh at generating code that uses fixed
registers (ie the old code often caused gcc to first generate the value
into another register, and then _move_ it into %eax, rather than just
generating it into %eax in the first place).
My test-kernel shrunk by a whopping 2kB in size from this change.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:40 ` Andreas Steinmetz
@ 2004-10-29 19:56 ` Linus Torvalds
2004-10-29 22:07 ` Jeff Garzik
2004-10-29 23:50 ` dean gaudet
1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:56 UTC (permalink / raw)
To: Andreas Steinmetz
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
>
> If you still believe in features I can't find any manufacturer
> documentation for, well, you're Linus so it's your decision.
It's not that I'm Linus. It's that I am apparently better informed than
you are, and the numbers you are looking at are irrelevant. For example,
have you even _looked_ at the Pentium M stack engine documentation, which
is what this whole argument is all about?
And the documentation you look at is not revelant. For example, when you
look at the latency of "pop", who _cares_? That's the latency to use the
data, and has no meaning, since in this case we don't actually ever use
it. So what matters is other things entirely, like how well the
instructions can run in parallell.
Try it.
popl %eax
popl %ecx
should one cycle on a Pentium. I pretty much _guarantee_ that
lea 4(%esp),%esp
popl %ecx
takes longer, since they have a data dependency on %esp that is hard to
break (the P4 trace-cache _may_ be able to break it, but the only CPU that
I think is likely to break it is actually the Transmeta CPU's, which did
that kind of thing by default and _will_ parallelise the two, and even
combine the stack offsetting into one single micro-op).
So my argument is that "popl" is smaller, and I doubt you can find a
machine where it's actually slower (most will take two cycles). And I am
pretty confident that I can find machines where it is faster (ie regular
Pentium).
Not that any of this matters, since there's a patch that makes all of this
moot. If it works.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:15 ` Linus Torvalds
@ 2004-10-29 19:40 ` Andreas Steinmetz
2004-10-29 19:56 ` Linus Torvalds
2004-10-29 23:50 ` dean gaudet
0 siblings, 2 replies; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 19:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
>
> On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
>
>
>>Linus Torvalds wrote:
>>
>>>Somebody should check what the Pentium M does. It might just notice that
>>>"lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
>>>possible that lea will confuse its stack engine logic and cause
>>>stack-related address generation stalls..
>>
>>Now especially Intel tells everybody in their Pentium Optimization
>>manuals to *use* lea whereever possible as this operation doesn't depend
>>on the ALU and is processed in other parts of the CPU.
>>
>>Sample quote from said manual (P/N 248966-05):
>>"Use the lea instruction and the full range of addressing modes to do
>>address calculation"
>
>
> Does it say this about %esp?
>
> The stack pointer is SPECIAL, guys. It's special exactly because there is
> potentially extra hardware in CPU's that track its value _independently_
> of the actual physical register.
It doesn't say anything about esp being specially treated by the
underlying hardware as far as I can see. Thus either you know details
about the cpu not being publically available or you're speculating about
undocumented features.
Some more data from said manual (lea is better on P3 and the same as add
on P4):
Instruction Latency Execution Unit
ADD/SUB: 0.5 ALU
POP 1.5 MEM_LOAD,ALU
Now, a P4 had two ALUs (Ports 0 and 1) but only one MEM_LOAD Unit (Port
2). So after all you will be stalled more likely by an additional pop
instruction than by lea/add. I don't know about P4 internals but let me
make some guess: There's lot of software around that needs to run on
older processors where lea has quite some performance advantage. Thus I
would guess that the P4 design respects this by handling lea x(esp),esp
efficiently.
If you still believe in features I can't find any manufacturer
documentation for, well, you're Linus so it's your decision.
--
Andreas Steinmetz SPAMmers use robotrap@domdv.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 19:20 ` Linus Torvalds
@ 2004-10-29 19:26 ` Linus Torvalds
2004-10-29 21:03 ` Linus Torvalds
1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:26 UTC (permalink / raw)
To: linux-os
Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
>
> Here's a totally untested patch to make the semaphores use "fastcall"
> instead of "asmlinkage", and thus pass the argument in %eax instead of on
> the stack. Does it work? I have no idea. If it does, it should fix the
> particular bug that started this thread..
Oh, sorry, please remove this part, it was totally unintentional (I _told_
you this wasn't tested):
> --- 1.4/include/asm-i386/linkage.h 2004-10-16 18:24:37 -07:00
> +++ edited/include/asm-i386/linkage.h 2004-10-29 11:32:18 -07:00
> @@ -1,7 +1,7 @@
> #ifndef __ASM_LINKAGE_H
> #define __ASM_LINKAGE_H
>
> -#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
> +#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))
> #define FASTCALL(x) x __attribute__((regparm(3)))
> #define fastcall __attribute__((regparm(3)))
>
We're not making all asmlinkage things fastcalls here, we're only doing
the semaphores..
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 17:22 ` linux-os
2004-10-29 17:55 ` Richard Henderson
@ 2004-10-29 19:20 ` Linus Torvalds
2004-10-29 19:26 ` Linus Torvalds
2004-10-29 21:03 ` Linus Torvalds
1 sibling, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:20 UTC (permalink / raw)
To: linux-os
Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
Here's a totally untested patch to make the semaphores use "fastcall"
instead of "asmlinkage", and thus pass the argument in %eax instead of on
the stack. Does it work? I have no idea. If it does, it should fix the
particular bug that started this thread..
Linus
---
===== arch/i386/kernel/semaphore.c 1.10 vs edited =====
--- 1.10/arch/i386/kernel/semaphore.c 2004-04-12 10:53:59 -07:00
+++ edited/arch/i386/kernel/semaphore.c 2004-10-29 12:19:22 -07:00
@@ -49,12 +49,12 @@
* we cannot lose wakeup events.
*/
-asmlinkage void __up(struct semaphore *sem)
+fastcall void __up(struct semaphore *sem)
{
wake_up(&sem->wait);
}
-asmlinkage void __sched __down(struct semaphore * sem)
+fastcall void __sched __down(struct semaphore * sem)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
@@ -91,7 +91,7 @@
tsk->state = TASK_RUNNING;
}
-asmlinkage int __sched __down_interruptible(struct semaphore * sem)
+fastcall int __sched __down_interruptible(struct semaphore * sem)
{
int retval = 0;
struct task_struct *tsk = current;
@@ -154,7 +154,7 @@
* single "cmpxchg" without failure cases,
* but then it wouldn't work on a 386.
*/
-asmlinkage int __down_trylock(struct semaphore * sem)
+fastcall int __down_trylock(struct semaphore * sem)
{
int sleepers;
unsigned long flags;
@@ -183,9 +183,9 @@
* need to convert that sequence back into the C sequence when
* there is contention on the semaphore.
*
- * %ecx contains the semaphore pointer on entry. Save the C-clobbered
- * registers (%eax, %edx and %ecx) except %eax when used as a return
- * value..
+ * %eax contains the semaphore pointer on entry. Save the C-clobbered
+ * registers (%eax, %edx and %ecx) except %eax whish is either a return
+ * value or just clobbered..
*/
asm(
".section .sched.text\n"
@@ -196,13 +196,11 @@
"pushl %ebp\n\t"
"movl %esp,%ebp\n\t"
#endif
- "pushl %eax\n\t"
"pushl %edx\n\t"
"pushl %ecx\n\t"
"call __down\n\t"
"popl %ecx\n\t"
"popl %edx\n\t"
- "popl %eax\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
"popl %ebp\n\t"
@@ -257,13 +255,11 @@
".align 4\n"
".globl __up_wakeup\n"
"__up_wakeup:\n\t"
- "pushl %eax\n\t"
"pushl %edx\n\t"
"pushl %ecx\n\t"
"call __up\n\t"
"popl %ecx\n\t"
"popl %edx\n\t"
- "popl %eax\n\t"
"ret"
);
===== include/asm-i386/linkage.h 1.4 vs edited =====
--- 1.4/include/asm-i386/linkage.h 2004-10-16 18:24:37 -07:00
+++ edited/include/asm-i386/linkage.h 2004-10-29 11:32:18 -07:00
@@ -1,7 +1,7 @@
#ifndef __ASM_LINKAGE_H
#define __ASM_LINKAGE_H
-#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
+#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(3)))
#define FASTCALL(x) x __attribute__((regparm(3)))
#define fastcall __attribute__((regparm(3)))
===== include/asm-i386/semaphore.h 1.9 vs edited =====
--- 1.9/include/asm-i386/semaphore.h 2004-08-27 00:02:38 -07:00
+++ edited/include/asm-i386/semaphore.h 2004-10-29 12:06:48 -07:00
@@ -87,15 +87,15 @@
sema_init(sem, 0);
}
-asmlinkage void __down_failed(void /* special register calling convention */);
-asmlinkage int __down_failed_interruptible(void /* params in registers */);
-asmlinkage int __down_failed_trylock(void /* params in registers */);
-asmlinkage void __up_wakeup(void /* special register calling convention */);
-
-asmlinkage void __down(struct semaphore * sem);
-asmlinkage int __down_interruptible(struct semaphore * sem);
-asmlinkage int __down_trylock(struct semaphore * sem);
-asmlinkage void __up(struct semaphore * sem);
+fastcall void __down_failed(void /* special register calling convention */);
+fastcall int __down_failed_interruptible(void /* params in registers */);
+fastcall int __down_failed_trylock(void /* params in registers */);
+fastcall void __up_wakeup(void /* special register calling convention */);
+
+fastcall void __down(struct semaphore * sem);
+fastcall int __down_interruptible(struct semaphore * sem);
+fastcall int __down_trylock(struct semaphore * sem);
+fastcall void __up(struct semaphore * sem);
/*
* This is ugly, but we want the default case to fall through.
@@ -111,12 +111,13 @@
"js 2f\n"
"1:\n"
LOCK_SECTION_START("")
- "2:\tcall __down_failed\n\t"
+ "2:\tlea %0,%%eax\n\t"
+ "call __down_failed\n\t"
"jmp 1b\n"
LOCK_SECTION_END
:"=m" (sem->count)
- :"c" (sem)
- :"memory");
+ :
+ :"memory","ax");
}
/*
@@ -135,11 +136,12 @@
"xorl %0,%0\n"
"1:\n"
LOCK_SECTION_START("")
- "2:\tcall __down_failed_interruptible\n\t"
+ "2:\tlea %1,%%eax\n\t"
+ "call __down_failed_interruptible\n\t"
"jmp 1b\n"
LOCK_SECTION_END
:"=a" (result), "=m" (sem->count)
- :"c" (sem)
+ :
:"memory");
return result;
}
@@ -159,11 +161,12 @@
"xorl %0,%0\n"
"1:\n"
LOCK_SECTION_START("")
- "2:\tcall __down_failed_trylock\n\t"
+ "2:\tlea %1,%%eax\n\t"
+ "call __down_failed_trylock\n\t"
"jmp 1b\n"
LOCK_SECTION_END
:"=a" (result), "=m" (sem->count)
- :"c" (sem)
+ :
:"memory");
return result;
}
@@ -182,13 +185,14 @@
"jle 2f\n"
"1:\n"
LOCK_SECTION_START("")
- "2:\tcall __up_wakeup\n\t"
+ "2:\tlea %0,%%eax\n\t"
+ "call __up_wakeup\n\t"
"jmp 1b\n"
LOCK_SECTION_END
".subsection 0\n"
:"=m" (sem->count)
- :"c" (sem)
- :"memory");
+ :
+ :"memory","ax");
}
#endif
===== include/linux/spinlock.h 1.32 vs edited =====
--- 1.32/include/linux/spinlock.h 2004-10-24 16:24:20 -07:00
+++ edited/include/linux/spinlock.h 2004-10-29 12:08:14 -07:00
@@ -27,7 +27,7 @@
extra \
".ifndef " LOCK_SECTION_NAME "\n\t" \
LOCK_SECTION_NAME ":\n\t" \
- ".endif\n\t"
+ ".endif\n"
#define LOCK_SECTION_END \
".previous\n\t"
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:58 ` Andreas Steinmetz
@ 2004-10-29 19:15 ` Linus Torvalds
2004-10-29 19:40 ` Andreas Steinmetz
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:15 UTC (permalink / raw)
To: Andreas Steinmetz
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> Linus Torvalds wrote:
> > Somebody should check what the Pentium M does. It might just notice that
> > "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> > possible that lea will confuse its stack engine logic and cause
> > stack-related address generation stalls..
>
> Now especially Intel tells everybody in their Pentium Optimization
> manuals to *use* lea whereever possible as this operation doesn't depend
> on the ALU and is processed in other parts of the CPU.
>
> Sample quote from said manual (P/N 248966-05):
> "Use the lea instruction and the full range of addressing modes to do
> address calculation"
Does it say this about %esp?
The stack pointer is SPECIAL, guys. It's special exactly because there is
potentially extra hardware in CPU's that track its value _independently_
of the actual physical register.
Just for fun, google for 'x86 "stack engine"', and you'll hit for example
http://arstechnica.com/articles/paedia/cpu/pentium-m.ars/5 which talks
about this and perhaps explains it in ways that I apparently haven't been
able to.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:39 ` linux-os
@ 2004-10-29 19:12 ` Linus Torvalds
2004-11-01 1:31 ` linux-os
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 19:12 UTC (permalink / raw)
To: linux-os
Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, linux-os wrote:
>
> Linus, there is no way in hell that you are going to move
> a value from memory into a register (pop ecx) faster than
> you are going to do anything to the stack-pointer or
> any other register.
Sorry, but you're wrong.
Learn about modern CPU's some day, and realize that cached accesses are
fast, and pipeline stalls are relatively much more expensive.
Now, if it was uncached, you'd have a point.
Also think about why
call xxx
jmp yy
is often much faster than
push $yy
jmp xxx
and other small interesting facts about how CPU's actually work these
days.
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:06 ` Linus Torvalds
2004-10-29 18:39 ` linux-os
@ 2004-10-29 18:58 ` Andreas Steinmetz
2004-10-29 19:15 ` Linus Torvalds
2004-10-29 23:37 ` dean gaudet
2 siblings, 1 reply; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 18:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
> Somebody should check what the Pentium M does. It might just notice that
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> possible that lea will confuse its stack engine logic and cause
> stack-related address generation stalls..
Now especially Intel tells everybody in their Pentium Optimization
manuals to *use* lea whereever possible as this operation doesn't depend
on the ALU and is processed in other parts of the CPU.
Sample quote from said manual (P/N 248966-05):
"Use the lea instruction and the full range of addressing modes to do
address calculation"
I would guess Intel would add caveats about such stalls in this manual
if there would be any.
--
Andreas Steinmetz SPAMmers use robotrap@domdv.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:42 ` Linus Torvalds
@ 2004-10-29 18:54 ` Linus Torvalds
2004-10-30 3:35 ` Jeff Garzik
1 sibling, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:54 UTC (permalink / raw)
To: linux-os
Cc: Richard Henderson, Kernel Mailing List, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
> Anyway, making "asmlinkage" imply "regparm(3)" would make the whole
> discussion moot, so I'm wondering if anybody has the patches to try it
> out? It requires pretty big changes to all the x86 asm code, but I do know
> that people _had_ patches like that at least long ago (from when people
> like Jan were playing with -mregaparm=3 originally). Maybe some of them
> still exist..
Looking at just doing this for the semaphore code, I hit on the fact that
we already do this for the rwsem's.. So changing just the regular
semaphore code to use "fastcall" should fix this particular bug, but I'm
still interested in hearing whether somebody has a patch for the system
calls and faults too?
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:17 ` linux-os
@ 2004-10-29 18:42 ` Linus Torvalds
2004-10-29 18:54 ` Linus Torvalds
2004-10-30 3:35 ` Jeff Garzik
0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:42 UTC (permalink / raw)
To: linux-os
Cc: Richard Henderson, Kernel Mailing List, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, linux-os wrote:
> On Fri, 29 Oct 2004, Richard Henderson wrote:
> >
> > Also not necessarily correct. Intel cpus special-case pop
> > instructions; two pops can be dual issued, whereas a different
> > kind of stack pointer manipulation will not.
> >
>
> Then I guess the Intel documentation is incorrect, too.
Where?
It definitely depends on the CPU. Some CPU's dual-issue pops, some don't.
I think the Pentium can dual-issue, while the PPro/P4 does not. And AMD
has some other rules, and I think older ones dual-issue stack accesses
only if esp doesn't change. Haven't looked at K8 rules.
And Pentium M is to some degree more interesting than P4 and Ppro, because
it's apparently the architecture Intel is going forward with for the
future of x86, and it is a "improved PPro" core that has a special stack
engine, iirc.
Anyway, it's quite likely that for several CPU's the fastest sequence ends
up actually being
movl 4(%esp),%ecx
movl 8(%esp),%edx
movl 12(%esp),%eax
addl $16,%esp
which is also one of the biggest alternatives.
Anyway, making "asmlinkage" imply "regparm(3)" would make the whole
discussion moot, so I'm wondering if anybody has the patches to try it
out? It requires pretty big changes to all the x86 asm code, but I do know
that people _had_ patches like that at least long ago (from when people
like Jan were playing with -mregaparm=3 originally). Maybe some of them
still exist..
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:06 ` Linus Torvalds
@ 2004-10-29 18:39 ` linux-os
2004-10-29 19:12 ` Linus Torvalds
2004-10-29 18:58 ` Andreas Steinmetz
2004-10-29 23:37 ` dean gaudet
2 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 18:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>> with the following:
>>>
>>> leal 4(%esp),%esp
>>
>> Probably so because I'm pretty certain that the 'pop' (a memory
>> access) is not going to be faster than a simple register operation.
>
> Bzzt, wrong answer.
>
> It's not "simple register operation". It's really about the fact that
> modern CPU's are smarter - yet dumber - then you think. They do things
> like speculate the value of %esp in order to avoid having to calculate it,
> and it's entirely possible that "pop" is much faster, simply because I
> guarantee you that a CPU will speculate %esp correctly across a "pop", but
> the same is not necessarily true for "lea %esp".
>
> Somebody should check what the Pentium M does. It might just notice that
> "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> possible that lea will confuse its stack engine logic and cause
> stack-related address generation stalls..
>
> Linus
Linus, there is no way in hell that you are going to move
a value from memory into a register (pop ecx) faster than
you are going to do anything to the stack-pointer or
any other register. The register operations operate
at the internal CPU clock-rate (GHz). The memory operations
operate at the front-side bus rate (MHz), and the data-
movement must actually occur before anything else can.
In other words, with stack operations, modern CPUs will
stall until the operation has completed.
Using the rdtsc, on this computer, both of the stack-pointer
additions (leal or add) take 6 +/- 2 clocks. The pop ecx
takes 12 +/- 3 clocks.
Things that should take only one clock, according to the
documentation, take 4 or 5 even when subtracting-out
the time necessary to do the rdtsc, because this machine
(and probably others) is very noisy, so all I can state
with certainty is that the pop from the stack takes longer.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 14:46 ` Linus Torvalds
` (3 preceding siblings ...)
2004-10-29 17:57 ` Richard Henderson
@ 2004-10-29 18:37 ` Gabriel Paubert
4 siblings, 0 replies; 99+ messages in thread
From: Gabriel Paubert @ 2004-10-29 18:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, Oct 29, 2004 at 07:46:06AM -0700, Linus Torvalds wrote:
>
>
> On Fri, 29 Oct 2004, linux-os wrote:
> >
> > Linus, please check this out.
>
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx
> obviously gets immediately overwritten by the next popl).
Rather popl %eax or popl %edx then, a basic and MMX Pentium
cannot pair:
popl %ecx
popl %ecx
for the simple reason that two instructions that have the
same destination register can't be paired.
OTOH, the other argument about reading or not memory in
this thread are a red herring. An additional memory read
is cheap for data that is guaranteed to be in a cache line
used by adjacent (in time) instructions.
Otherwise regparm(1) might even be better, movl %ecx,%eax is
the same size as push+pop, is faster, and may even reduce
stack usage by 4 bytes.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 18:18 ` Linus Torvalds
@ 2004-10-29 18:35 ` Richard Henderson
0 siblings, 0 replies; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 18:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andi Kleen, linux-os, Kernel Mailing List, Andrew Morton, Jan Hubicka
On Fri, Oct 29, 2004 at 11:18:33AM -0700, Linus Torvalds wrote:
> What's happens if there are more arguments than three? It happens for
> several system calls - does gcc still consider the stack part of the thing
> to be owned by the callee?
Yes.
r~
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 15:11 ` Andi Kleen
@ 2004-10-29 18:18 ` Linus Torvalds
2004-10-29 18:35 ` Richard Henderson
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:18 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andrew Morton,
Jan Hubicka
On Fri, 29 Oct 2004, Andi Kleen wrote:
>
> > Richard, Jan, Andi? Or does it already exist somewhere?
>
> How about just using __attribute__((regparm(1))) ? Then the
> problem doesn't appear.
Yes, we could use regparm for all assembly. Right now "asmlinkage"
actually _disables_ regparm so that we always have the same calling
convention for assembly regardless of whether the rest of the kernel is
compiled with regparm or not, but we could certainly change that
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
to use "regparm(3)" instead. I guess it's stable these days, since we use
it for FASTCALL() and friends too.
> Would be faster too. It should work reliable on all supported compilers.
What's happens if there are more arguments than three? It happens for
several system calls - does gcc still consider the stack part of the thing
to be owned by the callee?
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 17:55 ` Richard Henderson
@ 2004-10-29 18:17 ` linux-os
2004-10-29 18:42 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 18:17 UTC (permalink / raw)
To: Richard Henderson
Cc: Linus Torvalds, Kernel Mailing List, Andi Kleen, Andrew Morton,
Jan Hubicka
On Fri, 29 Oct 2004, Richard Henderson wrote:
> On Fri, Oct 29, 2004 at 01:22:52PM -0400, linux-os wrote:
>> Here's a version that uses `leal 4(esp), esp` to add
>> 4 to the stack-pointer. Since this 'address-calculation`
>> is done in an different portion of Intel CPUs....
>
> Incorrect, at least i686 and beyond. These interpret to the
> same micro-ops.
>
>> The 'pop ecx' would access memory and, therefore be slower than
>> simple register operations.
>
> Also not necessarily correct. Intel cpus special-case pop
> instructions; two pops can be dual issued, whereas a different
> kind of stack pointer manipulation will not.
>
Then I guess the Intel documentation is incorrect, too.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 17:08 ` linux-os
@ 2004-10-29 18:06 ` Linus Torvalds
2004-10-29 18:39 ` linux-os
` (2 more replies)
0 siblings, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 18:06 UTC (permalink / raw)
To: linux-os
Cc: Andreas Steinmetz, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, linux-os wrote:
> > with the following:
> >
> > leal 4(%esp),%esp
>
> Probably so because I'm pretty certain that the 'pop' (a memory
> access) is not going to be faster than a simple register operation.
Bzzt, wrong answer.
It's not "simple register operation". It's really about the fact that
modern CPU's are smarter - yet dumber - then you think. They do things
like speculate the value of %esp in order to avoid having to calculate it,
and it's entirely possible that "pop" is much faster, simply because I
guarantee you that a CPU will speculate %esp correctly across a "pop", but
the same is not necessarily true for "lea %esp".
Somebody should check what the Pentium M does. It might just notice that
"lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
possible that lea will confuse its stack engine logic and cause
stack-related address generation stalls..
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 14:46 ` Linus Torvalds
` (2 preceding siblings ...)
2004-10-29 17:22 ` linux-os
@ 2004-10-29 17:57 ` Richard Henderson
2004-10-29 18:37 ` Gabriel Paubert
4 siblings, 0 replies; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 17:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, Oct 29, 2004 at 07:46:06AM -0700, Linus Torvalds wrote:
> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?
Certainly we'd accept the feature, it's just a matter of
doing the work.
r~
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 17:22 ` linux-os
@ 2004-10-29 17:55 ` Richard Henderson
2004-10-29 18:17 ` linux-os
2004-10-29 19:20 ` Linus Torvalds
1 sibling, 1 reply; 99+ messages in thread
From: Richard Henderson @ 2004-10-29 17:55 UTC (permalink / raw)
To: linux-os
Cc: Linus Torvalds, Kernel Mailing List, Andi Kleen, Andrew Morton,
Jan Hubicka
On Fri, Oct 29, 2004 at 01:22:52PM -0400, linux-os wrote:
> Here's a version that uses `leal 4(esp), esp` to add
> 4 to the stack-pointer. Since this 'address-calculation`
> is done in an different portion of Intel CPUs....
Incorrect, at least i686 and beyond. These interpret to the
same micro-ops.
> The 'pop ecx' would access memory and, therefore be slower than
> simple register operations.
Also not necessarily correct. Intel cpus special-case pop
instructions; two pops can be dual issued, whereas a different
kind of stack pointer manipulation will not.
r~
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 14:46 ` Linus Torvalds
2004-10-29 15:11 ` Andi Kleen
2004-10-29 16:06 ` Andreas Steinmetz
@ 2004-10-29 17:22 ` linux-os
2004-10-29 17:55 ` Richard Henderson
2004-10-29 19:20 ` Linus Torvalds
2004-10-29 17:57 ` Richard Henderson
2004-10-29 18:37 ` Gabriel Paubert
4 siblings, 2 replies; 99+ messages in thread
From: linux-os @ 2004-10-29 17:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
>
> On Fri, 29 Oct 2004, linux-os wrote:
>>
>> Linus, please check this out.
>
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx
> obviously gets immediately overwritten by the next popl).
>
> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?
> Richard, Jan, Andi? Or does it already exist somewhere?
>
> Linus
>
Here's a version that uses `leal 4(esp), esp` to add
4 to the stack-pointer. Since this 'address-calculation`
is done in an different portion of Intel CPUs, there
is some parallel operation that can occur. The 'pop ecx'
would access memory and, therefore be slower than
simple register operations.
FYI I'm running a kernel with this patch already.
--- linux-2.6.9/arch/i386/kernel/semaphore.c.orig 2004-10-29 13:00:17.961579368 -0400
+++ linux-2.6.9/arch/i386/kernel/semaphore.c 2004-10-29 13:03:35.046617888 -0400
@@ -198,9 +198,11 @@
#endif
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Register to save
+ "pushl %ecx\n\t" // Passed parameter
"call __down\n\t"
- "popl %ecx\n\t"
+ "leal 0x04(%esp), %esp\t\n" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore original
"popl %edx\n\t"
"popl %eax\n\t"
#if defined(CONFIG_FRAME_POINTER)
@@ -220,9 +222,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_interruptible\n\t"
- "popl %ecx\n\t"
+ "leal 0x04(%esp), %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -241,9 +245,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_trylock\n\t"
- "popl %ecx\n\t"
+ "leal 0x04(%esp), %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -259,9 +265,11 @@
"__up_wakeup:\n\t"
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __up\n\t"
- "popl %ecx\n\t"
+ "leal 0x04(%esp), %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
"popl %eax\n\t"
"ret"
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 16:06 ` Andreas Steinmetz
@ 2004-10-29 17:08 ` linux-os
2004-10-29 18:06 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 17:08 UTC (permalink / raw)
To: Andreas Steinmetz
Cc: Linus Torvalds, Kernel Mailing List, Richard Henderson,
Andi Kleen, Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
> Linus Torvalds wrote:
>>
>> On Fri, 29 Oct 2004, linux-os wrote:
>>
>>> Linus, please check this out.
>>
>>
>> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
>> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx
>> obviously gets immediately overwritten by the next popl).
>
> Hmm, I didn't check the instruction length but modern CPUs usually work best
> with the following:
>
> leal 4(%esp),%esp
>
> --
> Andreas Steinmetz SPAMmers use robotrap@domdv.de
>
Probably so because I'm pretty certain that the 'pop' (a memory
access) is not going to be faster than a simple register operation.
I'll make another patch and post it (if the machine will boot!)
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 14:46 ` Linus Torvalds
2004-10-29 15:11 ` Andi Kleen
@ 2004-10-29 16:06 ` Andreas Steinmetz
2004-10-29 17:08 ` linux-os
2004-10-29 17:22 ` linux-os
` (2 subsequent siblings)
4 siblings, 1 reply; 99+ messages in thread
From: Andreas Steinmetz @ 2004-10-29 16:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
Linus Torvalds wrote:
>
> On Fri, 29 Oct 2004, linux-os wrote:
>
>>Linus, please check this out.
>
>
> Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
> "popl %ecx", which is smaller and apparently faster on some CPU's (ecx
> obviously gets immediately overwritten by the next popl).
Hmm, I didn't check the instruction length but modern CPUs usually work
best with the following:
leal 4(%esp),%esp
--
Andreas Steinmetz SPAMmers use robotrap@domdv.de
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 14:46 ` Linus Torvalds
@ 2004-10-29 15:11 ` Andi Kleen
2004-10-29 18:18 ` Linus Torvalds
2004-10-29 16:06 ` Andreas Steinmetz
` (3 subsequent siblings)
4 siblings, 1 reply; 99+ messages in thread
From: Andi Kleen @ 2004-10-29 15:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-os, Kernel Mailing List, Richard Henderson, Andrew Morton,
Jan Hubicka
> Btw, this is another case where we _really_ want "asmlinkage" to mean that
> the compiler does not own the argument stack. Is there any chance of
> getting a function attribute like that into future versions of gcc?
> Richard, Jan, Andi? Or does it already exist somewhere?
How about just using __attribute__((regparm(1))) ? Then the
problem doesn't appear.
Would be faster too. It should work reliable on all supported compilers.
-Andi
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Semaphore assembly-code bug
2004-10-29 12:12 ` Semaphore assembly-code bug linux-os
@ 2004-10-29 14:46 ` Linus Torvalds
2004-10-29 15:11 ` Andi Kleen
` (4 more replies)
0 siblings, 5 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-10-29 14:46 UTC (permalink / raw)
To: linux-os
Cc: Kernel Mailing List, Richard Henderson, Andi Kleen,
Andrew Morton, Jan Hubicka
On Fri, 29 Oct 2004, linux-os wrote:
>
> Linus, please check this out.
Yes, I concur. However, I'd suggest changing the "addl $4,%esp" into a
"popl %ecx", which is smaller and apparently faster on some CPU's (ecx
obviously gets immediately overwritten by the next popl).
Btw, this is another case where we _really_ want "asmlinkage" to mean that
the compiler does not own the argument stack. Is there any chance of
getting a function attribute like that into future versions of gcc?
Richard, Jan, Andi? Or does it already exist somewhere?
Linus
--- saved for gcc people commentary ---
>
> asmlinkage void __up(struct semaphore *sem)
> {
> wake_up(&sem->wait);
> }
>
> This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
> It this case, the value of 'sem' is destroyed which means that
> certain assembly-language helper functions no longer work.
>
> This was discovered by Aleksey Gorelov <Aleksey_Gorelov@Phoenix.com>.
>
> This patch fixes it, but I think somebody may need to rework
> the semaphore code to eliminate the assembly because the newer
> compilers are not consistent in their handling of passed parameters
> so some assembly optimization may no longer be possible.
>
>
> --- linux-2.6.9/arch/i386/kernel/semaphore.c.orig 2004-08-14 01:36:56.000000000 -0400
> +++ linux-2.6.9/arch/i386/kernel/semaphore.c 2004-10-19 08:06:15.000000000 -0400
> @@ -198,9 +198,11 @@
> #endif
> "pushl %eax\n\t"
> "pushl %edx\n\t"
> - "pushl %ecx\n\t"
> + "pushl %ecx\n\t" // Register to save
> + "pushl %ecx\n\t" // Passed parameter
> "call __down\n\t"
> - "popl %ecx\n\t"
> + "addl $0x04, %esp\t\n" // Bypass corrupted parameter
> + "popl %ecx\n\t" // Restore original
> "popl %edx\n\t"
> "popl %eax\n\t"
> #if defined(CONFIG_FRAME_POINTER)
> @@ -220,9 +222,11 @@
> "movl %esp,%ebp\n\t"
> #endif
> "pushl %edx\n\t"
> - "pushl %ecx\n\t"
> + "pushl %ecx\n\t" // Save register
> + "pushl %ecx\n\t" // Passed parameter
> "call __down_interruptible\n\t"
> - "popl %ecx\n\t"
> + "addl $0x04, %esp\n\t" // Bypass corrupted parameter
> + "popl %ecx\n\t" // Restore register
> "popl %edx\n\t"
> #if defined(CONFIG_FRAME_POINTER)
> "movl %ebp,%esp\n\t"
> @@ -241,9 +245,11 @@
> "movl %esp,%ebp\n\t"
> #endif
> "pushl %edx\n\t"
> - "pushl %ecx\n\t"
> + "pushl %ecx\n\t" // Save register
> + "pushl %ecx\n\t" // Passed parameter
> "call __down_trylock\n\t"
> - "popl %ecx\n\t"
> + "addl $0x04, %esp\n\t" // Bypass corrupted parameter
> + "popl %ecx\n\t" // Restore register
> "popl %edx\n\t"
> #if defined(CONFIG_FRAME_POINTER)
> "movl %ebp,%esp\n\t"
> @@ -259,9 +265,11 @@
> "__up_wakeup:\n\t"
> "pushl %eax\n\t"
> "pushl %edx\n\t"
> - "pushl %ecx\n\t"
> + "pushl %ecx\n\t" // Save register
> + "pushl %ecx\n\t" // Passed parameter
> "call __up\n\t"
> - "popl %ecx\n\t"
> + "addl $0x04, %esp\n\t" // Bypass corrupted parameter
> + "popl %ecx\n\t" // Restore register
> "popl %edx\n\t"
> "popl %eax\n\t"
> "ret"
>
>
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
> Notice : All mail here is now cached for review by John Ashcroft.
> 98.36% of all statistics are fiction.
>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Semaphore assembly-code bug
2004-10-20 11:49 ` Richard B. Johnson
@ 2004-10-29 12:12 ` linux-os
2004-10-29 14:46 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: linux-os @ 2004-10-29 12:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kernel Mailing List
Linus, please check this out.
This 'C' compiler destroys parameters passed to functions
even though the code does not alter that parameter.
gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The 'C' compiler is provided in a recent Fedora distribution.
For instance:
asmlinkage void __up(struct semaphore *sem)
{
wake_up(&sem->wait);
}
This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
It this case, the value of 'sem' is destroyed which means that
certain assembly-language helper functions no longer work.
This was discovered by Aleksey Gorelov <Aleksey_Gorelov@Phoenix.com>.
This patch fixes it, but I think somebody may need to rework
the semaphore code to eliminate the assembly because the newer
compilers are not consistent in their handling of passed parameters
so some assembly optimization may no longer be possible.
--- linux-2.6.9/arch/i386/kernel/semaphore.c.orig 2004-08-14 01:36:56.000000000 -0400
+++ linux-2.6.9/arch/i386/kernel/semaphore.c 2004-10-19 08:06:15.000000000 -0400
@@ -198,9 +198,11 @@
#endif
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Register to save
+ "pushl %ecx\n\t" // Passed parameter
"call __down\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\t\n" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore original
"popl %edx\n\t"
"popl %eax\n\t"
#if defined(CONFIG_FRAME_POINTER)
@@ -220,9 +222,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_interruptible\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -241,9 +245,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_trylock\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -259,9 +265,11 @@
"__up_wakeup:\n\t"
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __up\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
"popl %eax\n\t"
"ret"
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 99+ messages in thread
end of thread, other threads:[~2004-11-03 21:23 UTC | newest]
Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.58.0410181540080.2287@ppc970.osdl.org.suse.lists.linux.kernel>
[not found] ` <417550FB.8020404@drdos.com.suse.lists.linux.kernel>
[not found] ` <1098218286.8675.82.camel@mentorng.gurulabs.com.suse.lists.linux.kernel>
[not found] ` <41757478.4090402@drdos.com.suse.lists.linux.kernel>
[not found] ` <20041020034524.GD10638@michonline.com.suse.lists.linux.kernel>
[not found] ` <1098245904.23628.84.camel@krustophenia.net.suse.lists.linux.kernel>
[not found] ` <1098247307.23628.91.camel@krustophenia.net.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.61.0410200744310.10521@chaos.analogic.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.61.0410290805570.11823@chaos.analogic.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0410290740120.28839@ppc970.osdl.org.suse.lists.linux.kernel>
[not found] ` <41826A7E.6020801@domdv.de.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.61.0410291255400.17270@chaos.analogic.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0410291103000.28839@ppc970.osdl.org.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.61.0410291631250.8616@twinlark.arctic.org.suse.lists.linux.kernel>
2004-10-30 2:04 ` Semaphore assembly-code bug Andi Kleen
[not found] ` <Pine.LNX.4.61.0410291316470.3945@chaos.analogic.com.suse.lists.linux.kernel>
[not found] ` <20041029175527.GB25764@redhat.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.61.0410291416040.4844@chaos.analogic.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0410291133220.28839@ppc970.osdl.org.suse.lists.linux.kernel>
2004-10-30 2:13 ` Andi Kleen
2004-10-30 9:28 ` Denis Vlasenko
2004-10-30 17:53 ` Linus Torvalds
2004-10-30 21:00 ` Denis Vlasenko
2004-10-30 21:14 ` code bloat [was Re: Semaphore assembly-code bug] Lee Revell
2004-10-30 22:11 ` Denis Vlasenko
2004-10-30 22:25 ` Lee Revell
2004-10-31 14:06 ` Diego Calleja
2004-10-31 20:53 ` Z Smith
2004-10-31 23:35 ` Rogério Brito
2004-11-01 1:20 ` Z Smith
2004-11-01 14:48 ` Diego Calleja
2004-11-01 15:09 ` [OT] " Russell Miller
2004-10-30 22:27 ` Tim Hockin
2004-10-30 22:44 ` Jeff Garzik
2004-10-30 22:50 ` Tim Hockin
2004-10-31 20:15 ` Theodore Ts'o
2004-10-31 20:21 ` Jeff Garzik
2004-10-31 21:06 ` Jan Engelhardt
2004-11-01 11:27 ` Alan Cox
2004-11-01 13:40 ` Denis Vlasenko
2004-11-01 23:04 ` Alan Cox
2004-10-30 23:13 ` Denis Vlasenko
2004-10-30 22:45 ` Alan Cox
2004-10-31 1:21 ` Z Smith
2004-10-31 2:47 ` Jim Nelson
2004-10-31 15:19 ` Alan Cox
2004-10-31 20:18 ` Z Smith
2004-11-01 11:05 ` Alan Cox
2004-10-30 23:20 ` [OT] " Lee Revell
2004-10-30 22:52 ` Alan Cox
2004-10-31 1:09 ` Ken Moffat
2004-10-31 2:42 ` Tim Connors
2004-10-31 4:45 ` Paul
2004-10-31 14:44 ` Alan Cox
2004-10-31 0:48 ` Andi Kleen
2004-10-30 23:28 ` Tim Hockin
2004-10-31 2:04 ` Michael Clark
2004-10-31 6:49 ` Jan Engelhardt
2004-10-31 21:09 ` Z Smith
2004-10-31 21:13 ` Jan Engelhardt
2004-10-31 21:48 ` Z Smith
2004-11-01 11:29 ` Alan Cox
2004-11-01 12:36 ` Jan Engelhardt
2004-11-01 15:17 ` Lee Revell
2004-11-01 16:56 ` Kristian Høgsberg
2004-10-31 6:37 ` Jan Engelhardt
2004-10-31 0:39 ` Semaphore assembly-code bug Andi Kleen
2004-10-31 1:43 ` Linus Torvalds
2004-10-31 2:04 ` Andi Kleen
2004-10-18 22:45 Linux v2.6.9 Linus Torvalds
2004-10-19 17:38 ` Linux v2.6.9 and GPL Buyout Jeff V. Merkey
2004-10-19 20:38 ` Dax Kelson
2004-10-19 20:09 ` Jeff V. Merkey
2004-10-20 3:45 ` Ryan Anderson
2004-10-20 4:18 ` Lee Revell
2004-10-20 4:41 ` Lee Revell
2004-10-20 11:49 ` Richard B. Johnson
2004-10-29 12:12 ` Semaphore assembly-code bug linux-os
2004-10-29 14:46 ` Linus Torvalds
2004-10-29 15:11 ` Andi Kleen
2004-10-29 18:18 ` Linus Torvalds
2004-10-29 18:35 ` Richard Henderson
2004-10-29 16:06 ` Andreas Steinmetz
2004-10-29 17:08 ` linux-os
2004-10-29 18:06 ` Linus Torvalds
2004-10-29 18:39 ` linux-os
2004-10-29 19:12 ` Linus Torvalds
2004-11-01 1:31 ` linux-os
2004-11-01 5:49 ` Linus Torvalds
2004-11-01 20:23 ` dean gaudet
2004-11-01 20:52 ` linux-os
2004-11-01 21:23 ` dean gaudet
2004-11-01 22:22 ` linux-os
2004-11-01 21:40 ` Linus Torvalds
2004-11-01 21:46 ` Linus Torvalds
2004-11-02 15:02 ` linux-os
2004-11-02 16:02 ` Linus Torvalds
2004-11-02 16:06 ` Linus Torvalds
2004-11-02 16:51 ` linux-os
2004-11-01 22:16 ` linux-os
2004-11-01 22:26 ` Linus Torvalds
2004-11-01 23:14 ` linux-os
2004-11-01 23:42 ` Linus Torvalds
2004-11-03 1:52 ` Horst von Brand
2004-11-03 21:24 ` Bill Davidsen
2004-11-02 6:37 ` Chris Friesen
2004-10-29 18:58 ` Andreas Steinmetz
2004-10-29 19:15 ` Linus Torvalds
2004-10-29 19:40 ` Andreas Steinmetz
2004-10-29 19:56 ` Linus Torvalds
2004-10-29 22:07 ` Jeff Garzik
2004-10-29 23:50 ` dean gaudet
2004-10-30 0:15 ` Linus Torvalds
2004-10-29 23:37 ` dean gaudet
2004-10-29 17:22 ` linux-os
2004-10-29 17:55 ` Richard Henderson
2004-10-29 18:17 ` linux-os
2004-10-29 18:42 ` Linus Torvalds
2004-10-29 18:54 ` Linus Torvalds
2004-10-30 3:35 ` Jeff Garzik
2004-10-29 19:20 ` Linus Torvalds
2004-10-29 19:26 ` Linus Torvalds
2004-10-29 21:03 ` Linus Torvalds
2004-10-29 17:57 ` Richard Henderson
2004-10-29 18:37 ` Gabriel Paubert
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.