linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 7.52 second kernel compile
@ 2002-03-18 22:12 Dieter Nützel
  2002-03-18 22:46 ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Dieter Nützel @ 2002-03-18 22:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel List

On Mon, 18 Mar 2002, 20:23:48 Linus Torvalds wrote:
> On Mon, 18 Mar 2002, Linus Torvalds wrote:
> >
> > Well, I actually hink that an x86 comes fairly close.
>
> Btw, here's a program that does a simple histogram of TLB miss cost, and
> shows the interesting pattern on intel I was talking about: every 8th miss
> is most costly, apparently because Intel pre-fetches 8 TLB entries at a
> time.
>
> So on a PII core, you'll see something like
>
>          87.50: 36
>          12.39: 40
>
> ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
> 1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
> actually loading the thing from the data cache).
>
> Yeah, my program might be buggy, so take the numbers with a pinch of salt.
> But it's interesting to see how on an athlon the numbers are
>
>           3.17: 59
>          34.94: 62
>           4.71: 85
>          54.83: 88
>
> ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
> where that pattern would come from..

Linus,

it seems to be that it depends on gcc and flags.

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 2
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 998.068
cache size      : 512 KB

/home/nuetzel> gcc -v
Reading specs from /usr/lib/gcc-lib/i486-suse-linux/2.95.3/specs
gcc version 2.95.3 20010315 (SuSE)

SuSE default (-march=i486 -mcpu=i486)
/home/nuetzel> gcc -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  12.72: 19
  85.15: 21
0.460u 0.050s 0:00.50 102.0%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  12.75: 19
  84.92: 21
0.510u 0.010s 0:00.51 101.9%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -mcpu=i686 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  12.96: 19
  84.57: 21
0.460u 0.050s 0:00.50 102.0%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -mcpu=k6 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  13.16: 19
  84.88: 21
0.490u 0.010s 0:00.50 100.0%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -O2 -mcpu=i686 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
   2.03: 67
   1.33: 80
   3.50: 82
  19.65: 91
   1.37: 92
  18.17: 93
   1.59: 94
  41.68: 97
   2.83: 98
   1.82: 106
   1.60: 107
0.450u 0.000s 0:00.46 97.8%     0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -O2 -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
   1.98: 67
   1.28: 80
   3.37: 82
  19.78: 91
   1.37: 92
  18.30: 93
   1.59: 94
  41.71: 97
   2.84: 98
   1.82: 106
   1.60: 107
0.440u 0.010s 0:00.46 97.8%     0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -O -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  31.73: 19
  46.76: 22
   9.90: 29
   8.23: 30
0.430u 0.030s 0:00.45 102.2%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -O1 -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  35.17: 19
  47.28: 22
   7.92: 29
   6.70: 30
0.420u 0.040s 0:00.45 102.2%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -Os -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
   2.66: 67
   1.79: 80
   4.51: 82
  18.58: 91
   1.31: 92
  17.11: 93
   1.68: 94
  40.38: 97
   2.87: 98
   1.80: 106
   1.68: 107
0.470u 0.010s 0:00.49 97.9%     0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -march=i486 -mcpu=i486 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  17.12: 19
  80.45: 21
0.480u 0.030s 0:00.50 102.0%    0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -march=i686 -mcpu=i686 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  17.23: 19
  80.57: 21
0.480u 0.010s 0:00.50 98.0%     0+0k 0+0io 101pf+0w

/home/nuetzel> gcc -march=k6 -mcpu=k6 -o TLB_miss TLB_miss.c
/home/nuetzel> time ./TLB_miss
  14.15: 19
  83.81: 21
0.480u 0.030s 0:00.50 102.0%    0+0k 0+0io 101pf+0w

-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel@hamburg.de


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:12 7.52 second kernel compile Dieter Nützel
@ 2002-03-18 22:46 ` Linus Torvalds
  2002-03-18 23:53   ` Davide Libenzi
  2002-03-19  0:20   ` David S. Miller
  0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-18 22:46 UTC (permalink / raw)
  To: Dieter Nützel; +Cc: Linux Kernel List


On Mon, 18 Mar 2002, Dieter [iso-8859-15] Nützel wrote:
>
> it seems to be that it depends on gcc and flags.

That instability doesn't seem to show up on a PII. Interesting. Looks like 
the athlon may be reordering TLB accesses, while the PII apparently 
doesn't.

Or maybe the program is just flawed, and the interesting 1/8 pattern comes 
from something else altogether.

			Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:46 ` Linus Torvalds
@ 2002-03-18 23:53   ` Davide Libenzi
  2002-03-19  0:20   ` David S. Miller
  1 sibling, 0 replies; 40+ messages in thread
From: Davide Libenzi @ 2002-03-18 23:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dieter Nützel, Linux Kernel List

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 1625 bytes --]

On Mon, 18 Mar 2002, Linus Torvalds wrote:

>
> On Mon, 18 Mar 2002, Dieter [iso-8859-15] Nützel wrote:
> >
> > it seems to be that it depends on gcc and flags.
>
> That instability doesn't seem to show up on a PII. Interesting. Looks like
> the athlon may be reordering TLB accesses, while the PII apparently
> doesn't.
>
> Or maybe the program is just flawed, and the interesting 1/8 pattern comes
> from something else altogether.


Umhh, something magic should happen inside the Athlon p/line to explain this :


processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 999.561
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
			pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1992.29



$ gcc -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    %eax, -16(%ebp)
    movl    -4(%ebp), %eax
    addl    -12(%ebp), %eax
    movl    (%eax), %eax
#APP
    rdtsc
#NO_APP
    movl    %eax, -20(%ebp)


98.76: 21



$ gcc -O2 -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    -16(%ebp), %edx
    movl    %eax, %ecx
    movl    (%ebx,%edx), %eax
#APP
    rdtsc
#NO_APP
    subl    %ecx, %eax


97.59: 94


The only thing i can think is that stuff is moved between the two rdtsc
... maybe a barrier should help to have more consistent results.




- Davide



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:46 ` Linus Torvalds
  2002-03-18 23:53   ` Davide Libenzi
@ 2002-03-19  0:20   ` David S. Miller
  2002-03-19  0:47     ` Davide Libenzi
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: David S. Miller @ 2002-03-19  0:20 UTC (permalink / raw)
  To: torvalds; +Cc: Dieter.Nuetzel, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Mon, 18 Mar 2002 14:46:04 -0800 (PST)
   
   Or maybe the program is just flawed, and the interesting 1/8 pattern comes 
   from something else altogether.

I think the weird Athlon behavior has to do with the fact that
you've made your little test program as much of a cache tester
as a TLB tester :-)

I've made some modifications to the program:

1) Killed 4096 PAGE_SIZE assumption
2) Size of BUFFER_SIZE made it a cache miss measurement rather
   than a TLB miss measurement tool in certain cases (non-set
   assosciative L2 caches). I've decreased it to 16MB.  But
   see below for more discussion on this.
3) Made tick measurements take into account the cost of
   the tick reads themselves (which typically do flush the
   pipeline on either side of the tick read).  This is computed
   portably before the tests run and the result is used in
   the rdtsc() macro. 
4) Sparc64 rdtsc()

Actually, with non-set assosciative caches, it is often the case that
the TLB can hold entires for more than the size of the L2 cache
_IFF_ we access the first word of each page in the access() loops.

A great fix for this is to offset each access by some cache line
size, I've used 128 for this in my changes.  In this way we are much
less likely to make this turn into a cache miss tester.

I've choosen 16MB for BUFFER_SIZE becuase this amounts to a:

	(16MB / PAGE_SIZE)

such that for the largest normal page size (8192) it gives the
largest number of TLB entries I know any D-TLB has.  This is
1024 entries for UltraSPARC-III's data TLB (it's actually 512
entry, 2 way set assosciative).  I am potentially way off in this
estimate, so if there is some chip Linux runs on which has more
D-TLB entries, please fix up the code and let me know :-)

I have a program called lat_tlb which I wrote a long time ago, it is
very Sparc64 specific and I used it to measure the best case TLB miss
overhead our software TLB refill could get.  Oh, this program also
used jumps into a special assembly file full of "return" instructions
to measure instruction TLB misses as well which I thought was neat.
I can send the lat_tlb sources to anyone who is interested.

On UltraSPARC-III this "best case" data TLB miss cost is ~80 cycles,
on UltraSPARC-I/II/IIi/IIe it is ~50 cycles.

The result of "linus_lattlb" on UltraSPARC-III is:

pagesize: 8192 pageshift: 13 cachelines: 64
tick read overhead is 7
  14.39: 79
  69.48: 93
   8.00: 94
   2.32: 95
   2.26: 105

on all the older UltraSPARCs it is:

pagesize: 8192 pageshift: 13 cachelines: 64
tick read overhead is 5
   5.43: 41
  87.12: 43
   6.37: 48

On my Athlon 1800+ XP I get:

pagesize: 4096 pageshift: 12 cachelines: 32
tick read overhead is 11
  92.95: 16
   1.54: 18
   1.10: 21
   1.10: 28

(Just to make sure, on the Athlon I increased BUFFER_SIZE over and
 over again until it was 128MB, Linus's original value, this
 did not change the results at all)

Below are my changes to "linus_lattlb.c" :-)

To compile on UltraSPARC please add the "-Wa,-Av9a" option to gcc
so that it allows the TICK register read instructions.

Also be sure to compile with -O2 as this can change the results
slightly as well.

--- linus_lattlb.c.~1~	Mon Mar 18 14:13:58 2002
+++ linus_lattlb.c	Mon Mar 18 16:06:38 2002
@@ -1,28 +1,84 @@
 #include <stdlib.h>
 
+#if defined(__i386__)
 #define rdtsc(low) \
-   __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")
+do {	__asm__ __volatile__("rdtsc" : "=a" (low) : : "edx"); \
+	low -= overhead; \
+} while (0)
+#elif defined(__sparc__)
+#define rdtsc(low) \
+do {    __asm__ __volatile__("rd %%tick, %0" : "=r" (low)); \
+	low -= overhead; \
+} while (0)
+#endif
 
 #define MAXTIMES 1000
-#define BUFSIZE (128*1024*1024)
+#define BUFSIZE (16*1024*1024)
 #define access(x) (*(volatile unsigned int *)&(x))
+#define CACHE_LINE_SIZE	128
+
+#define COMPUTE_INDEX(idx, i)	\
+do {	(idx) = (i) + ((((i)>>pageshift) & (cachelines - 1)) * CACHE_LINE_SIZE); \
+} while (0)
 
 int main()
 {
+	unsigned long overhead, overhead_test, pagesize, pageshift, cachelines;
 	unsigned int i, j;
 	static int times[MAXTIMES];
 	char *buffer = malloc(BUFSIZE);
 
-	for (i = 0; i < BUFSIZE; i += 4096)
-		access(buffer[i]);
+	pagesize = getpagesize();
+	cachelines = (pagesize / CACHE_LINE_SIZE);
+
+	for (i = 0; i < 32; i++)
+		if ((1 << i) == pagesize)
+			break;
+
+	if (i == 32)
+		exit(1);
+
+	pageshift = i;
+	printf("pagesize: %lu pageshift: %lu cachelines: %lu\n",
+	       pagesize, pageshift, cachelines);
+
+	/* Remember, overhead is subtracted from the tick values read
+	 * so we have to calibrate it with a variable of a different
+	 * name.
+	 */
+	overhead = 0UL;
+	overhead_test = ~0UL;
+
+	for (i = 0; i < 8; i++) {
+		unsigned long start, end;
+		rdtsc(start);
+		rdtsc(end);
+		end -= start;
+		if (end < overhead_test)
+			overhead_test = end;
+	}
+	overhead = overhead_test;
+	printf("tick read overhead is %lu\n", overhead);
+
+	for (i = 0; i < BUFSIZE; i += pagesize) {
+		int idx;
+
+		COMPUTE_INDEX(idx, i);
+		access(buffer[idx]);
+	}
+
 	for (i = 0; i < MAXTIMES; i++)
 		times[i] = 0;
+
 	for (j = 0; j < 100; j++) {
-		for (i = 0; i < BUFSIZE ; i+= 4096) {
+		for (i = 0; i < BUFSIZE ; i+= pagesize) {
 			unsigned long start, end;
+			int idx;
+
+			COMPUTE_INDEX(idx, i);
 
 			rdtsc(start);
-			access(buffer[i]);
+			access(buffer[idx]);
 			rdtsc(end);
 			end -= start;
 			if (end >= MAXTIMES)
@@ -32,7 +88,7 @@
 	}
 	for (i = 0; i < MAXTIMES; i++) {
 		int count = times[i];
-		double percent = (double)count / (BUFSIZE/4096);
+		double percent = (double)count / (BUFSIZE/pagesize);
 		if (percent < 1)
 			continue;
 		printf("%7.2f: %d\n", percent, i);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:20   ` David S. Miller
@ 2002-03-19  0:47     ` Davide Libenzi
  2002-03-19  1:37     ` Andreas Ferber
  2002-03-19  2:08     ` Linus Torvalds
  2 siblings, 0 replies; 40+ messages in thread
From: Davide Libenzi @ 2002-03-19  0:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, Dieter.Nuetzel, Linux Kernel Mailing List

On Mon, 18 Mar 2002, David S. Miller wrote:

>    From: Linus Torvalds <torvalds@transmeta.com>
>    Date: Mon, 18 Mar 2002 14:46:04 -0800 (PST)
>
>    Or maybe the program is just flawed, and the interesting 1/8 pattern comes
>    from something else altogether.
>
> I think the weird Athlon behavior has to do with the fact that
> you've made your little test program as much of a cache tester
> as a TLB tester :-)

Uhm, it's moving to different pages and it does it consecutively. I think
Linus was trying to prove the multiple tlb entries fill for single miss ...



- Davide



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:20   ` David S. Miller
  2002-03-19  0:47     ` Davide Libenzi
@ 2002-03-19  1:37     ` Andreas Ferber
  2002-03-19  1:38       ` David S. Miller
  2002-03-19  2:08     ` Linus Torvalds
  2 siblings, 1 reply; 40+ messages in thread
From: Andreas Ferber @ 2002-03-19  1:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, Dieter.Nuetzel, linux-kernel

On Mon, Mar 18, 2002 at 04:20:31PM -0800, David S. Miller wrote:
>    
>    Or maybe the program is just flawed, and the interesting 1/8 pattern comes 
>    from something else altogether.
> I think the weird Athlon behavior has to do with the fact that
> you've made your little test program as much of a cache tester
> as a TLB tester :-)

Erm, you forgot COW semantics. The accesses to buffer are actually all
going to the same physical address. As CPU caches work on physical
addresses AFAIK (everything else would be just stupid ;-), there are
no cache misses (disregarding a few produced by IRQs/scheduling etc.).

Andreas
-- 
       Andreas Ferber - dev/consulting GmbH - Bielefeld, FRG
     ---------------------------------------------------------
         +49 521 1365800 - af@devcon.net - www.devcon.net

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  1:37     ` Andreas Ferber
@ 2002-03-19  1:38       ` David S. Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2002-03-19  1:38 UTC (permalink / raw)
  To: aferber; +Cc: torvalds, Dieter.Nuetzel, linux-kernel

   From: Andreas Ferber <aferber@techfak.uni-bielefeld.de>
   Date: Tue, 19 Mar 2002 02:37:55 +0100
   
   Erm, you forgot COW semantics. The accesses to buffer are actually all
   going to the same physical address. As CPU caches work on physical
   addresses AFAIK (everything else would be just stupid ;-), there are
   no cache misses (disregarding a few produced by IRQs/scheduling etc.).

ROFL, ignore that part of my postings then :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:20   ` David S. Miller
  2002-03-19  0:47     ` Davide Libenzi
  2002-03-19  1:37     ` Andreas Ferber
@ 2002-03-19  2:08     ` Linus Torvalds
  2002-03-19  5:24       ` Erik Andersen
  2 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2002-03-19  2:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: Dieter.Nuetzel, linux-kernel


On Mon, 18 Mar 2002, David S. Miller wrote:
>    
>    Or maybe the program is just flawed, and the interesting 1/8 pattern comes 
>    from something else altogether.
> 
> I think the weird Athlon behavior has to do with the fact that
> you've made your little test program as much of a cache tester
> as a TLB tester :-)

Oh, I was assuming that malloc(BIG) would do a mmap() of MAP_ANONYMOUS, 
which should make all the pages 100% shared, and thus basically zero cache 
overhead on a physically indexed machine like an x86. 

So it was designed to reall yonly stress the TLB, not the regular caches.

Although I have to admit that I didn't actually _test_ that hypothesis.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  2:08     ` Linus Torvalds
@ 2002-03-19  5:24       ` Erik Andersen
  0 siblings, 0 replies; 40+ messages in thread
From: Erik Andersen @ 2002-03-19  5:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Dieter.Nuetzel, linux-kernel

On Mon Mar 18, 2002 at 06:08:17PM -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Mar 2002, David S. Miller wrote:
> >    
> >    Or maybe the program is just flawed, and the interesting 1/8 pattern comes 
> >    from something else altogether.
> > 
> > I think the weird Athlon behavior has to do with the fact that
> > you've made your little test program as much of a cache tester
> > as a TLB tester :-)
> 
> Oh, I was assuming that malloc(BIG) would do a mmap() of MAP_ANONYMOUS, 

Perhaps adding an explicit 

    void *malloc(size_t size)
    {
	void *result = mmap((void *) 0, size + sizeof(size_t), PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	if (result == MAP_FAILED)
	    exit(EXIT_FAILURE);
	* (size_t *) result = size;
	return(result + sizeof(size_t));
    }

would ensure libc isn't trying to do something sneaky,

 -Erik

--
Erik B. Andersen             http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
                               ` (3 preceding siblings ...)
  2002-03-27  2:53             ` Richard Henderson
@ 2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 0 replies; 40+ messages in thread
From: Pablo Alcaraz @ 2002-04-02 10:50 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

>
>But it's interesting to see how on an athlon the numbers are
>
>	   3.17: 59
>	  34.94: 62
>	   4.71: 85
>	  54.83: 88
>
>ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
>where that pattern would come from..
>
In an athlon 1Ghz the numbers are:

94.49: 20
 2.51: 21

I don't why the numbers are so different.

Pablo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-27  2:53             ` Richard Henderson
@ 2002-04-02  4:32               ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-04-02  4:32 UTC (permalink / raw)
  To: Richard Henderson; +Cc: linux-kernel


On Tue, 26 Mar 2002, Richard Henderson wrote:
>
> For the record, Alpha timings:
> 
> pca164 @ 533MHz:
>   72.79: 19
>    1.50: 20
>   21.30: 35
>    1.50: 36
>    1.30: 105

Interesting. There seems to be three peaks: a big 4/1 split at 19-20 vs
35-36 cycles, which is probably just the L1 cache (8 bytes per entry,
32-byte cachelines on the EV5 gives 4 entries per cache load), while the
much smaller peak at 105 cycles might possibly be due to the virtual
lookup miss, causing a double TLB miss and a real walk every 8kB entries
(actually, much more often than that, since there's TLB pressure and the
virtual PTE mappings get thrown out faster than the theoretical numbers
would indicate)

It also shows how pretty studly it is to take a sw TLB miss quite that
quickly. Getting in and out of PAL-mode that fast is rather fast.

> ev6 @ 500MHz:
>    2.43: 78
>   72.13: 84
>    2.55: 89
>    5.87: 90
>    1.38: 105
>    5.94: 108
>    1.36: 112
> 
> I wonder how much of that ev6 slowdown is due to an SRM that's
> has to handle both 3 and 4 level page tables, and how much is
> due to the more expensive syncing of the OOO pipeline...

The multi-level page table shouldn't hurt at all for the common case (ie
the virtual PTE lookup success), so my money would be on the pipeline
flush.

The other profile difference seems to be due to the 64-byte cacheline (ie
a cacheline now holds 8 entries, so 7/8th can be filled that way).

However, I doubt whether that third peak could be a double PTE fault, it
seems too big and too close in cycles to the others. So maybe the third
peak at 108 cycles is something else... As it seems to balance out very
nicely with the second peak, I wonder if there might not be something
making every other cache fill faster - like a 128-byte prefetch or an
external 128-byte line on the L2/L3? (Ie the third peak would be really
just the "other half" of the second peak).

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2002-03-19  2:42             ` Paul Mackerras
@ 2002-03-27  2:53             ` Richard Henderson
  2002-04-02  4:32               ` Linus Torvalds
  2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 1 reply; 40+ messages in thread
From: Richard Henderson @ 2002-03-27  2:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

For the record, Alpha timings:

pca164 @ 533MHz:
  72.79: 19
   1.50: 20
  21.30: 35
   1.50: 36
   1.30: 105

ev6 @ 500MHz:
   2.43: 78
  72.13: 84
   2.55: 89
   5.87: 90
   1.38: 105
   5.94: 108
   1.36: 112

I wonder how much of that ev6 slowdown is due to an SRM that's
has to handle both 3 and 4 level page tables, and how much is
due to the more expensive syncing of the OOO pipeline...


r~

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:57                   ` Dave Jones
@ 2002-03-19  3:35                     ` Jeff Garzik
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Garzik @ 2002-03-19  3:35 UTC (permalink / raw)
  To: Dave Jones; +Cc: Paul Mackerras, linux-kernel

Dave Jones wrote:

>On Tue, Mar 19, 2002 at 10:52:40AM +1100, Paul Mackerras wrote:
> > The G4 has 4 performance monitor counters that you can set up to
> > measure things like ITLB misses, DTLB misses, cycles spent doing
> > tablewalks for ITLB misses and DTLB misses, etc.
> > What I need to do now is
> > to put some better infrastructure for using those counters in place
> > and try your program using those counters instead of the timebase.
>
> Sounds like a good candidate for the first non-x86 port of oprofile[1].
> Write the kernel part, and all the nice userspace tools come for free.
> There are also a few other perfctr abstraction projects, which are
> linked off the oprofile pages somewhere iirc.
>

Maybe this is why drepper doesn't like threaded profiling... he wants us 
all to use oprofile.

/me ducks and runs....





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
  2002-03-18 22:36             ` Cort Dougan
@ 2002-03-19  2:42             ` Paul Mackerras
  2002-03-27  2:53             ` Richard Henderson
  2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 0 replies; 40+ messages in thread
From: Paul Mackerras @ 2002-03-19  2:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Cort Dougan, linux-kernel

Linus Torvalds writes:

> Btw, here's a program that does a simple histogram of TLB miss cost, and
> shows the interesting pattern on intel I was talking about: every 8th miss
> is most costly, apparently because Intel pre-fetches 8 TLB entries at a
> time.

Here are the results on my 500Mhz G4 laptop:

   1.85: 22
  17.86: 26
  14.41: 28
  16.88: 42
  34.03: 46
   9.61: 48
   2.07: 88
   1.04: 90

The numbers are fairly repeatable except that the last two tend to
wobble around a little.  These are numbers of cycles obtained using
one of the performance monitor counters set to count every cycle.
The average is 40.6 cycles.

This was with a 512kB MMU hash table, which translates to 8192 hash
buckets each holding 8 ptes.  The machine has 1MB of L2 cache.

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:38                         ` David S. Miller
@ 2002-03-19  1:28                           ` Davide Libenzi
  0 siblings, 0 replies; 40+ messages in thread
From: Davide Libenzi @ 2002-03-19  1:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: cort, torvalds, linux-kernel

On Mon, 18 Mar 2002, David S. Miller wrote:

>    From: Cort Dougan <cort@fsmlabs.com>
>    Date: Mon, 18 Mar 2002 17:36:35 -0700
>
>    The structure of the program you suggested with more portable timing.
>
> Oh, just something like:
>
>
> 	gettimeofday(&stamp1);
> 	for (A MILLION TIMES) {
> 		TLB miss;
> 	}
> 	gettimeofday(&stamp2);

This make the measure stable on my machine :

#define rdtsc(low) \
   __asm__ __volatile__("rdtsc" : "=A" (low) : )


            unsigned long long start, end;

            rdtsc(start);
            access(buffer[i]);
            rdtsc(end);



processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 999.561
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
		pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1992.29



$ gcc -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    %eax, -24(%ebp)
    movl    %edx, -20(%ebp)
    movl    -4(%ebp), %eax
    addl    -12(%ebp), %eax
    movl    (%eax), %eax
#APP
    rdtsc


  11.89: 18
   4.70: 20
  81.90: 23



$ gcc -O2 -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    %edx, -28(%ebp)
    movl    -24(%ebp), %edx
    movl    %eax, -32(%ebp)
    movl    (%esi,%edx), %ecx
#APP
    rdtsc
#NO_APP


  87.70: 20
  11.24: 25




- Davide



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 23:52                 ` Paul Mackerras
@ 2002-03-19  0:57                   ` Dave Jones
  2002-03-19  3:35                     ` Jeff Garzik
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Jones @ 2002-03-19  0:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel

On Tue, Mar 19, 2002 at 10:52:40AM +1100, Paul Mackerras wrote:
 > The G4 has 4 performance monitor counters that you can set up to
 > measure things like ITLB misses, DTLB misses, cycles spent doing
 > tablewalks for ITLB misses and DTLB misses, etc.
 > What I need to do now is
 > to put some better infrastructure for using those counters in place
 > and try your program using those counters instead of the timebase.

 Sounds like a good candidate for the first non-x86 port of oprofile[1].
 Write the kernel part, and all the nice userspace tools come for free.
 There are also a few other perfctr abstraction projects, which are
 linked off the oprofile pages somewhere iirc.

[1] http://oprofile.sf.net

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:36                       ` Cort Dougan
@ 2002-03-19  0:38                         ` David S. Miller
  2002-03-19  1:28                           ` Davide Libenzi
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-03-19  0:38 UTC (permalink / raw)
  To: cort; +Cc: torvalds, linux-kernel

   From: Cort Dougan <cort@fsmlabs.com>
   Date: Mon, 18 Mar 2002 17:36:35 -0700

   The structure of the program you suggested with more portable timing.
   
Oh, just something like:


	gettimeofday(&stamp1);
	for (A MILLION TIMES) {
		TLB miss;
	}
	gettimeofday(&stamp2);

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:27                     ` David S. Miller
@ 2002-03-19  0:36                       ` Cort Dougan
  2002-03-19  0:38                         ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Cort Dougan @ 2002-03-19  0:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel

The structure of the program you suggested with more portable timing.

}    Any suggestions for a structure, Dave?
} 
} Structure?  Of what?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:27                   ` Cort Dougan
@ 2002-03-19  0:27                     ` David S. Miller
  2002-03-19  0:36                       ` Cort Dougan
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-03-19  0:27 UTC (permalink / raw)
  To: cort; +Cc: torvalds, linux-kernel

   From: Cort Dougan <cort@fsmlabs.com>
   Date: Mon, 18 Mar 2002 17:27:05 -0700
   
   Any suggestions for a structure, Dave?

Structure?  Of what?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:22                 ` David S. Miller
@ 2002-03-19  0:27                   ` Cort Dougan
  2002-03-19  0:27                     ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Cort Dougan @ 2002-03-19  0:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel

It would be easy to do with the debug registers on PPC but they're
supervisor level only.  Users have no need to profile their code, after
all.

A logic analyzer would be really handy here.  Dave, think you can swing
one? :)

I ended up using averages for my tests with the PPC when doing the MM
optimizations.  Wall-clock time tells you if you did a good thing or not,
but not what it was that you actually did :)

Any suggestions for a structure, Dave?

}    On Mon, 18 Mar 2002, Cort Dougan wrote:
}    > The cycle timer in this case is about 16.6MHz.
}    
}    Oh, you're cycle timer is too slow to be interesting, apparently ;(
} 
} We could modify the test program to use more portably timing functions
} and doing the TLB accesses several times over.  While this would get
} us something more reasonable on PPC, and be more portable, the results
} would be a bit less accurate because we'd be dealing effectively with
} averages instead of real cycle count samples.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
  2002-03-18 23:52                 ` Paul Mackerras
@ 2002-03-19  0:22                 ` David S. Miller
  2002-03-19  0:27                   ` Cort Dougan
  2 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2002-03-19  0:22 UTC (permalink / raw)
  To: torvalds; +Cc: cort, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Mon, 18 Mar 2002 14:47:19 -0800 (PST)

   On Mon, 18 Mar 2002, Cort Dougan wrote:
   > The cycle timer in this case is about 16.6MHz.
   
   Oh, you're cycle timer is too slow to be interesting, apparently ;(

We could modify the test program to use more portably timing functions
and doing the TLB accesses several times over.  While this would get
us something more reasonable on PPC, and be more portable, the results
would be a bit less accurate because we'd be dealing effectively with
averages instead of real cycle count samples.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
@ 2002-03-18 23:52                 ` Paul Mackerras
  2002-03-19  0:57                   ` Dave Jones
  2002-03-19  0:22                 ` David S. Miller
  2 siblings, 1 reply; 40+ messages in thread
From: Paul Mackerras @ 2002-03-18 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> Oh, you're cycle timer is too slow to be interesting, apparently ;(

The G4 has 4 performance monitor counters that you can set up to
measure things like ITLB misses, DTLB misses, cycles spent doing
tablewalks for ITLB misses and DTLB misses, etc.  I hacked up a
measurement of the misses and total cycles doing tablewalks during a
kernel compile and got an average of 36 cycles per DTLB miss and 40
cycles per ITLB miss on a 500MHz G4 machine.  What I need to do now is
to put some better infrastructure for using those counters in place
and try your program using those counters instead of the timebase.

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
@ 2002-03-18 22:56                 ` Cort Dougan
  2002-03-18 23:52                 ` Paul Mackerras
  2002-03-19  0:22                 ` David S. Miller
  2 siblings, 0 replies; 40+ messages in thread
From: Cort Dougan @ 2002-03-18 22:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Unfortunately so.  I have some boards here that have higher precision
timers but nothing approaching the clock rate of the chip.  I don't think
there are any PPC boards with timers at that rate.

Some of the 6xx or 74xx model debug registers may have something useful
here, though.

} Oh, you're cycle timer is too slow to be interesting, apparently ;(

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:36             ` Cort Dougan
@ 2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
                                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-18 22:47 UTC (permalink / raw)
  To: Cort Dougan; +Cc: linux-kernel


On Mon, 18 Mar 2002, Cort Dougan wrote:
> 
> The cycle timer in this case is about 16.6MHz.

Oh, you're cycle timer is too slow to be interesting, apparently ;(

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
@ 2002-03-18 22:36             ` Cort Dougan
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-19  2:42             ` Paul Mackerras
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Cort Dougan @ 2002-03-18 22:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Here's the modified for PPC version and the results.

The cycle timer in this case is about 16.6MHz.

# ./foo
  92.01: 1
   7.98: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2
# ./foo
  92.01: 1
   7.97: 2
# ./foo
  92.01: 1
   7.97: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2

#include <stdlib.h>

#if defined(__powerpc__)
#define rdtsc(low) \
   __asm__ __volatile__ ("mftb %0": "=r" (low))
#else
#define rdtsc(low) \
  __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")
#endif

#define MAXTIMES 1000
#define BUFSIZE (128*1024*1024)
#define access(x) (*(volatile unsigned int *)&(x))

int main()
{
	unsigned int i, j;
	static int times[MAXTIMES];
	char *buffer = malloc(BUFSIZE);

	for (i = 0; i < BUFSIZE; i += 4096)
		access(buffer[i]);
	for (i = 0; i < MAXTIMES; i++)
		times[i] = 0;
	for (j = 0; j < 100; j++) {
		for (i = 0; i < BUFSIZE ; i+= 4096) {
			unsigned long start, end;

			rdtsc(start);
			access(buffer[i]);
			rdtsc(end);
			end -= start;
			if (end >= MAXTIMES)
				end = MAXTIMES-1;
			times[end]++;
		}
	}
	for (i = 0; i < MAXTIMES; i++) {
		int count = times[i];
		double percent = (double)count / (BUFSIZE/4096);
		if (percent < 1)
			continue;
		printf("%7.2f: %d\n", percent, i);
	}
	return v0;
}


} Btw, here's a program that does a simple histogram of TLB miss cost, and
} shows the interesting pattern on intel I was talking about: every 8th miss
} is most costly, apparently because Intel pre-fetches 8 TLB entries at a
} time.
} 
} So on a PII core, you'll see something like
} 
} 	  87.50: 36
} 	  12.39: 40
} 
} ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
} 1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
} actually loading the thing from the data cache).
} 
} Yeah, my program might be buggy, so take the numbers with a pinch of salt.
} But it's interesting to see how on an athlon the numbers are
} 
} 	   3.17: 59
} 	  34.94: 62
} 	   4.71: 85
} 	  54.83: 88
} 
} ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
} where that pattern would come from..
} 
} What are the ppc numbers like (after modifying the rdtsc implementation,
} of course)? I suspect you'll get a less clear distribution depending on
} whether the hash lookup ends up hitting in the primary or secondary hash,
} and where in the list it hits, but..
} 
} 			Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 21:34           ` Cort Dougan
@ 2002-03-18 22:00             ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-18 22:00 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> } But the whole point of _scattering_ is so incredibly broken in itself!
> } Don't do it.
>
> Yes, that is indeed correct theoretically.  The problem is that we actually
> measured it and there was very little locality.  When I added some
> multiple-tlb loads it actually decreased wall-clock performance for nearly
> every user load I put on the machine.

This is what I meant by hardware support for multiple loads - you mustn't
let speculative TLB loads displace real TLB entries, for example.

> Linus, I knew that deep in my heart 8 years ago when I started in on all
> this.  I'm with you but I'm not good enough with a soldering iron to fix
> every powerpc out there that forces that crappy IBM spawned madness upon
> us.

Oh, I agree, we can't fix existing broken hardware, we'll ave to just live
with it.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
@ 2002-03-18 21:50             ` Rene Herman
  2002-03-18 22:36             ` Cort Dougan
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Rene Herman @ 2002-03-18 21:50 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

> So on a PII core, you'll see something like
> 
> 87.50: 36
> 12.39: 40
> 
> ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4%
> (ie 1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due
> to actually loading the thing from the data cache).
> 
> Yeah, my program might be buggy, so take the numbers with a pinch of
> salt. But it's interesting to see how on an athlon the numbers are
> 
>  3.17: 59
> 34.94: 62
>  4.71: 85
> 54.83: 88
> 
> ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't
> know where that pattern would come from..

You scared me, so I ran the program on my AMD duron. Result are 
completely repeatable (4 runs):

   4.17: 20
  92.89: 21
   1.17: 26

   4.17: 20
  93.00: 21
   1.18: 26

   4.17: 20
  92.86: 21
   1.18: 26

   4.16: 20
  92.78: 21
   1.16: 26

Ie, rather violently different from the numbers you quoted for the 
Athlon...

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 3
model name      : AMD Duron(tm) Processor 
stepping        : 1
cpu MHz         : 757.472
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat 
pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1510.60

Rene.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:04         ` Linus Torvalds
  2002-03-18 20:23           ` Linus Torvalds
@ 2002-03-18 21:34           ` Cort Dougan
  2002-03-18 22:00             ` Linus Torvalds
  1 sibling, 1 reply; 40+ messages in thread
From: Cort Dougan @ 2002-03-18 21:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, linux-kernel

I agree with you there.  On many PowerPC's, we're screwed.  The best thing
I can think of is the clever VSID allocation and trying to make a sane data
structure out of the hash table but that would involve a _lot_ of work with
very likely no reward.

} Hashes simply do not do the right thing. You cannot do a speculative load
} from a hash, and the hash overhead gets _bigger_ for TLB loads that miss
} (ie they optimize for the hit case, which is the wrong optimization if the
} on-chip TLB is big enough - and since Moore's law says that the on-chip
} TLB _will_ be big enough, that's just stupid).

What's the alternative for some PowerPC's?  Every shared library program
likes to use the exact same addresses which load (and thus create htab
entries) at exactly the same location.  A machine running 100+ processes is
not going to be usable because the every process is sharing the same 8 PTE
slots.

} But the whole point of _scattering_ is so incredibly broken in itself!
} Don't do it.

Yes, that is indeed correct theoretically.  The problem is that we actually
measured it and there was very little locality.  When I added some
multiple-tlb loads it actually decreased wall-clock performance for nearly
every user load I put on the machine.  The common apps now-a-days are using
10's of shared libs so that would make it even worse.

} You can load many TLB entries in one go, if you just keep them close-by to
} each other. Load them into a prefetch-buffer (so that you don't dirty your
} real TLB with speculative TLB loads), and since there tends to be locality
} to TLB's, you've just automatically speeded up your hardware walker.
} 
} In contrast, a hash algorithm automatically means that you cannot sanely
} do this _trivial_ optimization.

Linus, I knew that deep in my heart 8 years ago when I started in on all
this.  I'm with you but I'm not good enough with a soldering iron to fix
every powerpc out there that forces that crappy IBM spawned madness upon
us.

I even wrote a paper about how bad a design is and how the designers should
be whipped for their foolish choices on the PPC.  I'll hold the torch if
you knock on the castle door...

} Face it, hashes are BAD for things like this.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:04         ` Linus Torvalds
@ 2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
                               ` (4 more replies)
  2002-03-18 21:34           ` Cort Dougan
  1 sibling, 5 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-18 20:23 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Linus Torvalds wrote:
>
> Well, I actually hink that an x86 comes fairly close.

Btw, here's a program that does a simple histogram of TLB miss cost, and
shows the interesting pattern on intel I was talking about: every 8th miss
is most costly, apparently because Intel pre-fetches 8 TLB entries at a
time.

So on a PII core, you'll see something like

	  87.50: 36
	  12.39: 40

ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
actually loading the thing from the data cache).

Yeah, my program might be buggy, so take the numbers with a pinch of salt.
But it's interesting to see how on an athlon the numbers are

	   3.17: 59
	  34.94: 62
	   4.71: 85
	  54.83: 88

ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
where that pattern would come from..

What are the ppc numbers like (after modifying the rdtsc implementation,
of course)? I suspect you'll get a less clear distribution depending on
whether the hash lookup ends up hitting in the primary or secondary hash,
and where in the list it hits, but..

			Linus

-----
#include <stdlib.h>

#define rdtsc(low) \
   __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")

#define MAXTIMES 1000
#define BUFSIZE (128*1024*1024)
#define access(x) (*(volatile unsigned int *)&(x))

int main()
{
	unsigned int i, j;
	static int times[MAXTIMES];
	char *buffer = malloc(BUFSIZE);

	for (i = 0; i < BUFSIZE; i += 4096)
		access(buffer[i]);
	for (i = 0; i < MAXTIMES; i++)
		times[i] = 0;
	for (j = 0; j < 100; j++) {
		for (i = 0; i < BUFSIZE ; i+= 4096) {
			unsigned long start, end;

			rdtsc(start);
			access(buffer[i]);
			rdtsc(end);
			end -= start;
			if (end >= MAXTIMES)
				end = MAXTIMES-1;
			times[end]++;
		}
	}
	for (i = 0; i < MAXTIMES; i++) {
		int count = times[i];
		double percent = (double)count / (BUFSIZE/4096);
		if (percent < 1)
			continue;
		printf("%7.2f: %d\n", percent, i);
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 19:42       ` Cort Dougan
@ 2002-03-18 20:04         ` Linus Torvalds
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:34           ` Cort Dougan
  0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-18 20:04 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> I have a counter-proposal.  How about a hardware tlb load (if we must have
> one) that does the right thing?

Well, I actually hink that an x86 comes fairly close.

Hashes simply do not do the right thing. You cannot do a speculative load
from a hash, and the hash overhead gets _bigger_ for TLB loads that miss
(ie they optimize for the hit case, which is the wrong optimization if the
on-chip TLB is big enough - and since Moore's law says that the on-chip
TLB _will_ be big enough, that's just stupid).

Basic premise in caching: hardware gets better, and misses go down.

Which implies that misses due to cache contention are misses that go away
over time, while forced misses (due to program startup etc) matter more
and more over time.

Ergo, you want to make build-up fast, because that's where you can't avoid
the work by trivially just making your caches bigger. So you want to have
architecture support for aggressive TLB pre-loading.

> I still think there are some clever tricks one could do with the VSID's to
> get a much saner system that the current hash table.  It would take some
> serious work I think but the results could be worthwhile.  By carefully
> choosing the VSID scatter algorithm and the size of the hash table I think
> one could get a much better access method.

But the whole point of _scattering_ is so incredibly broken in itself!
Don't do it.

You can load many TLB entries in one go, if you just keep them close-by to
each other. Load them into a prefetch-buffer (so that you don't dirty your
real TLB with speculative TLB loads), and since there tends to be locality
to TLB's, you've just automatically speeded up your hardware walker.

In contrast, a hash algorithm automatically means that you cannot sanely
do this _trivial_ optimization.

Face it, hashes are BAD for things like this.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-17  2:00     ` Paul Mackerras
  2002-03-17  2:40       ` Linus Torvalds
@ 2002-03-18 19:42       ` Cort Dougan
  2002-03-18 20:04         ` Linus Torvalds
  1 sibling, 1 reply; 40+ messages in thread
From: Cort Dougan @ 2002-03-18 19:42 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

I have a counter-proposal.  How about a hardware tlb load (if we must have
one) that does the right thing?  I don't think the PPC is a good example of
a hardware well-managed TLB load process.  Software loads show up so well
on the PPC because it does some very very foolish things I suspect.  I've
had some conversations with Moto engineers who have suggested that my
suspicion that the TLB loads are actually cached when the hardware does
them so that we waste cache space with an line that we better darn well not
be loading again (otherwise we've thrown out our tlb way too early).

I still think there are some clever tricks one could do with the VSID's to
get a much saner system that the current hash table.  It would take some
serious work I think but the results could be worthwhile.  By carefully
choosing the VSID scatter algorithm and the size of the hash table I think
one could get a much better access method.

} However, one good argument against software TLB loading that I have
} heard (and which you mentioned in another message) is that loading a
} TLB entry in software requires taking an exception, which requires
} synchronizing the pipeline, which is expensive.  With hardware TLB
} reload you can just freeze the pipeline while the hardware does a
} couple of fetches from memory.  And PPC64 remains the only
} architecture I know of that supports a full 64-bit virtual address
} space _and_ can do hardware TLB reload.
} 
} I would be interested to see measurements of how many cache misses on
} average each hardware TLB reload takes; for a hash table I expect it
} would be about 1, for a 3-level tree I expect it would be very
} dependent on access pattern but I wouldn't be surprised if it averaged
} about 1 also on real workloads.
} 
} But this is all a bit academic, the real question is how do we deal
} most efficiently with the real hardware that is out there.  And if you
} want a 7.5 second kernel compile the only option currently available
} is a machine whose MMU uses a hash table. :)
} 
} Paul.
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
  2002-03-17  2:00     ` Paul Mackerras
@ 2002-03-18 19:37     ` Cort Dougan
  2 siblings, 0 replies; 40+ messages in thread
From: Cort Dougan @ 2002-03-18 19:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

In fact we _did_ do the second part.  Rather, I did anyway.  The zombie
reclaim code (used to live in idle.c before it was removed) would run much
like the zero-paged code I put in there.  It ran with the cache off to
avoid blowing the entire contents of the L1/L2 in the idle task.  It would
just invalidate (genuinely clearing the valid bit) for any hash table entry
that was stale (zombie was the term I used).

That method was a definite win in UP but didn't help terribly well for SMP
since once a processor bogs down it no longer gets the advantage of the
easy to find empty slot in the hash replacement code.

At this point, I think it would be worth throwing out the tlb invalidate
optimization (by changing VSID's) and benchmarking that against the code
with the optimization.  A test a year ago that I did showed that they were
pretty much even.  I'm betting the latest changes have made that
optimization an actual loss now.

Linus, shrinking that hash table was a very very bad thing.  Early on we
used a very small hash table and it really put too much pressure on the
entries and we were throwing them out nearly constantly.  Adding some code
to scatter the entries and use the table more efficient helped but a large
has table is a need, unfortunately.

The ultimate solution was actually not using the hash table on the 603's
that I added a few years ago.  I documented how doing this actually
improved performance in a OSDI paper from '99 that I have on my web page.
Linux, It's worth a look - it actually supports most of your opinions of
the PPC MMU.
 
} > I wonder if you wouldn't be better off just getting rid of the TLB range
} > flush altogether, and instead making it select a new VSID in the segment
} > register, and just forgetting about the old TLB contents entirely.
} > 
} > Then, when you do a TLB miss, you just re-use any hash table entries
} > that have a stale VSID.
} 
} We used to do something a bit like that on ppc32 - flush_tlb_mm would
} just assign a new mmu context number to the task, which translates
} into a new set of VSIDs.  We didn't do the second part, reusing hash
} table entries with stale VSIDs, because we couldn't see a good fast
} way to tell whether a given VSID was stale.  Instead, when the hash
} bucket filled up, we just picked an entry to overwrite semi-randomly.
} 
} It turned out that the stale VSIDs were causing us various problems,
} particularly on SMP, so I tried a solution that always cleared all the
} hash table entries, using a bit in the linux pte to say whether there
} was (or had ever been) a hash table entry corresponding to that pte as
} an optimization to avoid doing unnecessary hash lookups.  To my
} surprise, that turned out to be faster, so that's what we do now.
} 
} Your suggestion has the problem that when you get to needing to reuse
} one of the VSIDs that you have thrown away, it becomes very difficult
} and expensive to ensure that there aren't any stale hash table entries
} left around for that VSID - particularly on a system with logical
} partitioning where we don't control the size of the hash table.  And
} there is a finite number of VSIDs so you have to reuse them sooner or
} later.
} 
} [For those not familiar with the PPC MMU, think of the VSID as an MMU
} context number, but separately settable for each 256MB of the virtual
} address space.]
} 
} > It would also be interesting to hear if you can just make the hash table
} > smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
} 
} On ppc32 we use a hash table 1/4 of the recommended size and it works
} fine.
} 
} > just bypass it altogether (at least the 604e used to be able to just
} > disable the stupid hashing altogether and make the whole thing much
} > saner). 
} 
} That was the 603, actually.  In fact the newest G4 processors also let
} you do this.  When I get hold of a machine with one of these new G4
} chips I'm going to try it again and see how much faster it goes
} without the hash table.
} 
} One other thing - I would *love* it if we could get rid of
} flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
} _all_ of the flush_tlb_* functions tell us what address(es) we need to
} invalidate, and let the architecture code decide whether a complete
} TLB flush is justified.
} 
} Paul.
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: 7.52 second kernel compile
  2002-03-17  2:40       ` Linus Torvalds
@ 2002-03-17  2:50         ` M. Edward Borasky
  0 siblings, 0 replies; 40+ messages in thread
From: M. Edward Borasky @ 2002-03-17  2:50 UTC (permalink / raw)
  To: linux-kernel

Well ... along those lines ... I'll settle for my $1500US 5 GFLOP Athlon for
sound processing instead of the 12 MFLOP FPS AP120B I always dreamed of
owning :). We've sure come a long way in 20 years, eh?

M. Edward Borasky
The COUGAR Project

znmeb@borasky-research.net
http://www.borasky-research.com/Cougar.htm

> -----Original Message-----
> Yeah, at a cost of $2M+, if I'm not mistaken. I think I'll settle for my 2
> minute time that is actually available to mere mortals at a small fraction
> of one percent of that ;)
>
> 		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-17  2:00     ` Paul Mackerras
@ 2002-03-17  2:40       ` Linus Torvalds
  2002-03-17  2:50         ` M. Edward Borasky
  2002-03-18 19:42       ` Cort Dougan
  1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2002-03-17  2:40 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sun, 17 Mar 2002, Paul Mackerras wrote:
> 
> But this is all a bit academic, the real question is how do we deal
> most efficiently with the real hardware that is out there.  And if you
> want a 7.5 second kernel compile the only option currently available
> is a machine whose MMU uses a hash table. :)

Yeah, at a cost of $2M+, if I'm not mistaken. I think I'll settle for my 2
minute time that is actually available to mere mortals at a small fraction
of one percent of that ;)

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
@ 2002-03-17  2:00     ` Paul Mackerras
  2002-03-17  2:40       ` Linus Torvalds
  2002-03-18 19:42       ` Cort Dougan
  2002-03-18 19:37     ` Cort Dougan
  2 siblings, 2 replies; 40+ messages in thread
From: Paul Mackerras @ 2002-03-17  2:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> Remember: think about the hashes as just TLB's, and the VSID's are just 
> address space identifiers (yeah, yeah, you can have several VSID's per 
> process at least in 32-bit mode, I don't remember the 64-bit thing). So 
> what you do is the same thing alpha does with it's 6-bit ASN thing: when 
> you wrap around, you blast the whole TLB to kingdom come.

I have performance measurements that show that having stale hash-table
entries cluttering up the hash table hurts performance more than
taking the time to get rid of them does.  This is on ppc32 using
kernel compiles and lmbench as the performance measures.

> You _can_ switch the hash table base around on ppc64, can't you?

Not when running under a hypervisor (i.e. on a logically-partitioned
system), unfortunately.  It _may_ be possible to choose the VSIDs so
that we only use half (or less) of the hash table at any time.

> Maybe somebody is seeing the light.

Maybe.  Whenever I have been asked what hardware features should be
added to PPC chips to make Linux run better, I usually put having an
option for software loading of the TLB pretty high on the list.

However, one good argument against software TLB loading that I have
heard (and which you mentioned in another message) is that loading a
TLB entry in software requires taking an exception, which requires
synchronizing the pipeline, which is expensive.  With hardware TLB
reload you can just freeze the pipeline while the hardware does a
couple of fetches from memory.  And PPC64 remains the only
architecture I know of that supports a full 64-bit virtual address
space _and_ can do hardware TLB reload.

I would be interested to see measurements of how many cache misses on
average each hardware TLB reload takes; for a hash table I expect it
would be about 1, for a 3-level tree I expect it would be very
dependent on access pattern but I wouldn't be surprised if it averaged
about 1 also on real workloads.

But this is all a bit academic, the real question is how do we deal
most efficiently with the real hardware that is out there.  And if you
want a 7.5 second kernel compile the only option currently available
is a machine whose MMU uses a hash table. :)

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
@ 2002-03-16 18:32     ` Linus Torvalds
  2002-03-17  2:00     ` Paul Mackerras
  2002-03-18 19:37     ` Cort Dougan
  2 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2002-03-16 18:32 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sat, 16 Mar 2002, Paul Mackerras wrote:
> 
> Your suggestion has the problem that when you get to needing to reuse
> one of the VSIDs that you have thrown away, it becomes very difficult
> and expensive to ensure that there aren't any stale hash table entries
> left around for that VSID - particularly on a system with logical
> partitioning where we don't control the size of the hash table.

But the VSID is something like 20 bits, no? So the re-use is a fairly 
uncommon thing, in the end.

Remember: think about the hashes as just TLB's, and the VSID's are just 
address space identifiers (yeah, yeah, you can have several VSID's per 
process at least in 32-bit mode, I don't remember the 64-bit thing). So 
what you do is the same thing alpha does with it's 6-bit ASN thing: when 
you wrap around, you blast the whole TLB to kingdom come.

The alpha wraps around a lot more often with just 6 bits, but on the other 
hand it's a lot cheaper to get rid of the TLB too, so it evens out.

Yeah, there are latency issues, but that can be handled by just switching
the hash table base: you have two hash tables, and whenever you increment
the VSID you clear a small part of the other table, designed so that when
the VSID wraps around the other table is 100% clear, and you just switch
the two.

You _can_ switch the hash table base around on ppc64, can't you?

So now the VM invalidate becomes

	++vsid;
	partial_clear_secondary_hash();
	if (++vsid > MAXVSID)
		vsid = 0;
		switch_hashes();
	}

> > just bypass it altogether (at least the 604e used to be able to just
> > disable the stupid hashing altogether and make the whole thing much
> > saner). 
> 
> That was the 603, actually.

Ahh, my mind is going.

>			  In fact the newest G4 processors also let
> you do this.  When I get hold of a machine with one of these new G4
> chips I'm going to try it again and see how much faster it goes
> without the hash table.

Maybe somebody is seeing the light.

> One other thing - I would *love* it if we could get rid of
> flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
> _all_ of the flush_tlb_* functions tell us what address(es) we need to
> invalidate, and let the architecture code decide whether a complete
> TLB flush is justified.

Sure, sounds reasonable.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  8:05   ` Linus Torvalds
@ 2002-03-16 11:54     ` yodaiken
  0 siblings, 0 replies; 40+ messages in thread
From: yodaiken @ 2002-03-16 11:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sat, Mar 16, 2002 at 08:05:14AM +0000, Linus Torvalds wrote:
> It would also be interesting to hear if you can just make the hash table
> smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
> just bypass it altogether (at least the 604e used to be able to just
> disable the stupid hashing altogether and make the whole thing much
> saner). 

Reference:
URL: http://www.usenix.org/ Optimizing the Idle Task and Other MMU Tricks
Cort Dougan, Paul Mackerras, Victor Yodaiken
www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan.pdf


Cort's MS thesis was on this topic. IBM seems reluctant to give up on 
hardware page tables though.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
  2002-03-16  8:05   ` Linus Torvalds
@ 2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Paul Mackerras @ 2002-03-16 11:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> I wonder if you wouldn't be better off just getting rid of the TLB range
> flush altogether, and instead making it select a new VSID in the segment
> register, and just forgetting about the old TLB contents entirely.
> 
> Then, when you do a TLB miss, you just re-use any hash table entries
> that have a stale VSID.

We used to do something a bit like that on ppc32 - flush_tlb_mm would
just assign a new mmu context number to the task, which translates
into a new set of VSIDs.  We didn't do the second part, reusing hash
table entries with stale VSIDs, because we couldn't see a good fast
way to tell whether a given VSID was stale.  Instead, when the hash
bucket filled up, we just picked an entry to overwrite semi-randomly.

It turned out that the stale VSIDs were causing us various problems,
particularly on SMP, so I tried a solution that always cleared all the
hash table entries, using a bit in the linux pte to say whether there
was (or had ever been) a hash table entry corresponding to that pte as
an optimization to avoid doing unnecessary hash lookups.  To my
surprise, that turned out to be faster, so that's what we do now.

Your suggestion has the problem that when you get to needing to reuse
one of the VSIDs that you have thrown away, it becomes very difficult
and expensive to ensure that there aren't any stale hash table entries
left around for that VSID - particularly on a system with logical
partitioning where we don't control the size of the hash table.  And
there is a finite number of VSIDs so you have to reuse them sooner or
later.

[For those not familiar with the PPC MMU, think of the VSID as an MMU
context number, but separately settable for each 256MB of the virtual
address space.]

> It would also be interesting to hear if you can just make the hash table
> smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or

On ppc32 we use a hash table 1/4 of the recommended size and it works
fine.

> just bypass it altogether (at least the 604e used to be able to just
> disable the stupid hashing altogether and make the whole thing much
> saner). 

That was the 603, actually.  In fact the newest G4 processors also let
you do this.  When I get hold of a machine with one of these new G4
chips I'm going to try it again and see how much faster it goes
without the hash table.

One other thing - I would *love* it if we could get rid of
flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
_all_ of the flush_tlb_* functions tell us what address(es) we need to
invalidate, and let the architecture code decide whether a complete
TLB flush is justified.

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
@ 2002-03-16  8:05   ` Linus Torvalds
  2002-03-16 11:54     ` yodaiken
  2002-03-16 11:04   ` Paul Mackerras
  1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2002-03-16  8:05 UTC (permalink / raw)
  To: linux-kernel

In article <20020316061535.GA16653@krispykreme>,
Anton Blanchard  <anton@samba.org> wrote:
>
>hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

It's interesting to see that scalability doesn't seem to be the #1
problem by a long shot. 

>7.52 seconds is not a bad result for something running under a hypervisor.
>The profile looks much better now. We still spend a lot of time flushing tlb
>entries but we can look into batching them.

I wonder if you wouldn't be better off just getting rid of the TLB range
flush altogether, and instead making it select a new VSID in the segment
register, and just forgetting about the old TLB contents entirely.

Then, when you do a TLB miss, you just re-use any hash table entries
that have a stale VSID.

It seems that you spend _way_ too much time actually trying to
physically invalidate the hashtables, which sounds like a total waste to
me. Especially as going through them to see whether they need to be
invalidated has to be a horribe thing for the dcache.

It would also be interesting to hear if you can just make the hash table
smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
just bypass it altogether (at least the 604e used to be able to just
disable the stupid hashing altogether and make the whole thing much
saner). 

Note that the official IBM "minimum recommended page table sizes" stuff
looks like total and utter crap.  Those tables have nothing to do with
sanity, and everything to do with a crappy OS called AIX that takes
forever to fill the hashes.  You should probably make them the minimum
size (which, if I remember correctly, is still quite a large amount of
memory thrown away on a TLB) if you can't just disable them altogether. 

			Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* 7.52 second kernel compile
  2002-03-13  8:52 10.31 " Anton Blanchard
@ 2002-03-16  6:15 ` Anton Blanchard
  2002-03-16  8:05   ` Linus Torvalds
  2002-03-16 11:04   ` Paul Mackerras
  0 siblings, 2 replies; 40+ messages in thread
From: Anton Blanchard @ 2002-03-16  6:15 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


> Let the kernel compile benchmarks continue!

I think Im addicted. I need help!

In this update we added 8 cpus and rewrote the ppc64 pagetable management
code to do lockless inserts and removals (there is still locking at
the pte level to avoid races).

hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

kernel: 2.5.7-pre1 + ppc64 pagetable rework

kernel compiled: 2.4.18 x86 with Martin's config

compiler: gcc 2.95.3 x86 cross compiler

make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
128.89user 40.23system 0:07.52elapsed 2246%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (437084major+572835minor)pagefaults 0swaps

7.52 seconds is not a bad result for something running under a hypervisor.
The profile looks much better now. We still spend a lot of time flushing tlb
entries but we can look into batching them.

Anton
--
anton@samba.org
anton@au.ibm.com

155912 total                                      0.0550
114562 .cpu_idle                               

 12615 .local_flush_tlb_range                  
  8476 .local_flush_tlb_page                   
  2576 .insert_hpte_into_group                 

  1980 .do_anonymous_page                      
  1813 .lru_cache_add                          
  1390 .d_lookup                               
  1320 .__copy_tofrom_user                     
  1140 .save_remaining_regs                    
   612 .rmqueue                                
   517 .atomic_dec_and_lock                    
   492 .do_page_fault                          
   444 .copy_page                              
   438 .__free_pages_ok                        
   375 .set_page_dirty                         
   350 .zap_page_range                         
   314 .schedule                               
   270 .__find_get_page                        
   245 .page_cache_release                     
   233 .lru_cache_del                          
   231 .hvc_poll                               
   215 .sys_brk                                

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2002-04-02 12:39 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-18 22:12 7.52 second kernel compile Dieter Nützel
2002-03-18 22:46 ` Linus Torvalds
2002-03-18 23:53   ` Davide Libenzi
2002-03-19  0:20   ` David S. Miller
2002-03-19  0:47     ` Davide Libenzi
2002-03-19  1:37     ` Andreas Ferber
2002-03-19  1:38       ` David S. Miller
2002-03-19  2:08     ` Linus Torvalds
2002-03-19  5:24       ` Erik Andersen
  -- strict thread matches above, loose matches on Subject: below --
2002-03-13  8:52 10.31 " Anton Blanchard
2002-03-16  6:15 ` 7.52 " Anton Blanchard
2002-03-16  8:05   ` Linus Torvalds
2002-03-16 11:54     ` yodaiken
2002-03-16 11:04   ` Paul Mackerras
2002-03-16 18:32     ` Linus Torvalds
2002-03-17  2:00     ` Paul Mackerras
2002-03-17  2:40       ` Linus Torvalds
2002-03-17  2:50         ` M. Edward Borasky
2002-03-18 19:42       ` Cort Dougan
2002-03-18 20:04         ` Linus Torvalds
2002-03-18 20:23           ` Linus Torvalds
2002-03-18 21:50             ` Rene Herman
2002-03-18 22:36             ` Cort Dougan
2002-03-18 22:47               ` Linus Torvalds
2002-03-18 22:56                 ` Cort Dougan
2002-03-18 23:52                 ` Paul Mackerras
2002-03-19  0:57                   ` Dave Jones
2002-03-19  3:35                     ` Jeff Garzik
2002-03-19  0:22                 ` David S. Miller
2002-03-19  0:27                   ` Cort Dougan
2002-03-19  0:27                     ` David S. Miller
2002-03-19  0:36                       ` Cort Dougan
2002-03-19  0:38                         ` David S. Miller
2002-03-19  1:28                           ` Davide Libenzi
2002-03-19  2:42             ` Paul Mackerras
2002-03-27  2:53             ` Richard Henderson
2002-04-02  4:32               ` Linus Torvalds
2002-04-02 10:50             ` Pablo Alcaraz
2002-03-18 21:34           ` Cort Dougan
2002-03-18 22:00             ` Linus Torvalds
2002-03-18 19:37     ` Cort Dougan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).