All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.0 Huge pages not working as expected
@ 2003-12-26 10:54 Nick Craig-Wood
  2003-12-26 11:56 ` William Lee Irwin III
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Craig-Wood @ 2003-12-26 10:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rohit Seth

I've been trying out the huge page support using 2.6.0.  I compiled
with :-

CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y

And all looks good in /proc/meminfo

HugePages_Total:     8
HugePages_Free:      8
Hugepagesize:     4096 kB

I mounted a hugetlbfs on /mnt/hugetlb.

I wrote a little test program to show the benefits of huge pages by
reducing TLB thrashing - it fills up 16 MB with sequential numbers
then adds them with different strides - very much the sort of thing
FFTs do.  However huge pages show a performance decrease not increase
for large strides!  For smaller ones there is a small speedup.

I've been testing on

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 6
cpu MHz         : 551.405
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1089.53

Whats happening? Is there something broken in my program, the kernel,
or my understanding?  I know this isn't a particularly good
demonstration of reducing TLB thrashing as it should only read in
cacheline multiples, but I wasn't expecting it to slow down!

I've also been trying huge pages with mprime (which does lots of FFTs)
which does show some improvement (just 2% or so because it is already
very TLB aware).

Here are the results :-

------------------------------------------------------------
Memory from malloc()
Testing memory at 0x4015e008
span =        1, time =     71.212 ms, total = -2097152
span =        2, time =     71.744 ms, total = -2097152
span =        4, time =     88.352 ms, total = -2097152
span =        8, time =    176.207 ms, total = -2097152
span =       16, time =    176.166 ms, total = -2097152
span =       32, time =    176.385 ms, total = -2097152
span =       64, time =    179.042 ms, total = -2097152
span =      128, time =    184.059 ms, total = -2097152
span =      256, time =    195.014 ms, total = -2097152
span =      512, time =    217.084 ms, total = -2097152
span =     1024, time =    260.899 ms, total = -2097152
span =     2048, time =    259.714 ms, total = -2097152
span =     4096, time =    261.059 ms, total = -2097152

Memory from hugetlbfs
Testing memory at 0x41400000
span =        1, time =     70.815 ms, total = -2097152
span =        2, time =     71.261 ms, total = -2097152
span =        4, time =     88.178 ms, total = -2097152
span =        8, time =    175.512 ms, total = -2097152
span =       16, time =    174.996 ms, total = -2097152
span =       32, time =    175.689 ms, total = -2097152
span =       64, time =    177.301 ms, total = -2097152
span =      128, time =    181.705 ms, total = -2097152
span =      256, time =    191.232 ms, total = -2097152
span =      512, time =    209.886 ms, total = -2097152
span =     1024, time =    247.646 ms, total = -2097152
span =     2048, time =    279.525 ms, total = -2097152
span =     4096, time =    344.605 ms, total = -2097152

Memory from /dev/zero
Testing memory at 0x42400000
span =        1, time =     70.916 ms, total = -2097152
span =        2, time =     71.405 ms, total = -2097152
span =        4, time =     89.584 ms, total = -2097152
span =        8, time =    176.190 ms, total = -2097152
span =       16, time =    175.730 ms, total = -2097152
span =       32, time =    176.377 ms, total = -2097152
span =       64, time =    178.675 ms, total = -2097152
span =      128, time =    183.429 ms, total = -2097152
span =      256, time =    194.153 ms, total = -2097152
span =      512, time =    215.089 ms, total = -2097152
span =     1024, time =    256.428 ms, total = -2097152
span =     2048, time =    268.468 ms, total = -2097152
span =     4096, time =    268.702 ms, total = -2097152
------------------------------------------------------------

And here is the program...

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/mman.h>

#define MEMORY_FILE_NAME "/mnt/hugetlb/z"
#define MEMORY_SIZE (4*1024*1024)

/****************************************************************************
Returns the time in floating point seconds since the epoch - useful for more
accurate timing that time() allows for
****************************************************************************/

static double timef(void)
{
    struct timeval tv = {0, 0};
    gettimeofday(&tv, 0);
    return (double)tv.tv_sec + ((double)tv.tv_usec)/1E6;
}

/****************************************************************************
Test the memory with different spans - should show TLB thrashing nicely
****************************************************************************/

static void test(int *p)
{
    int i;
    int span;

    printf("Testing memory at %p\n", p);

    /* fill it */
    for (i = 0; i < MEMORY_SIZE; i++)
	p[i] = i;

    /* test it with different spans */
    for (span = 1; span <= 4096; span *= 2)
    {
	double start = timef();
	int j;
	int total = 0;

	for (j = 0; j < span; j++)
	{
	    for (i = j; i < MEMORY_SIZE; i+= span)
		total += p[i];
	}
	start = timef() - start;
	printf("span = %8d, time = %10.3f ms, total = %d\n", span, 1000*start, total);
    }
    printf("\n");
}

/****************************************************************************
Thrash the hugetlb
****************************************************************************/

int main(void)
{
    int *malloc_memory;
    int *hugepage_memory;
    int *devzero_memory;
    int fd;

    /* get some malloc memory */
    malloc_memory = calloc(MEMORY_SIZE, sizeof(int));
    if (malloc_memory == 0)
    {
	fprintf(stderr, "Couldn't allocate memory\n");
	exit(EXIT_FAILURE);
    }

    /* get some hugepage memory */
    fd = open(MEMORY_FILE_NAME, O_CREAT|O_RDWR, 0600);
    if (fd < 0)
    {
	fprintf(stderr, "Failed to open huge page memory file '%s': %s\n", MEMORY_FILE_NAME, strerror(errno));
	exit(EXIT_FAILURE);
    }
    hugepage_memory = mmap(0, MEMORY_SIZE * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    if (hugepage_memory == MAP_FAILED)
    {
	fprintf(stderr, "Huge page mmap() failed: %s\n", strerror(errno));
	exit(EXIT_FAILURE);
    }

    /* get some /dev/zero memory */
    fd = open("/dev/zero", O_CREAT|O_RDWR, 0600);
    if (fd < 0)
    {
	fprintf(stderr, "Failed to open /dev/zero memory file: %s\n", strerror(errno));
	exit(EXIT_FAILURE);
    }
    devzero_memory = mmap(0, MEMORY_SIZE * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    if (devzero_memory == MAP_FAILED)
    {
	fprintf(stderr, "Huge page mmap() failed: %s\n", strerror(errno));
	exit(EXIT_FAILURE);
    }

    printf("Memory from malloc()\n");
    test(malloc_memory);

    printf("Memory from hugetlbfs\n");
    test(hugepage_memory);

    printf("Memory from /dev/zero\n");
    test(devzero_memory);

    unlink(MEMORY_FILE_NAME);

    return EXIT_SUCCESS;
}


-- 
Nick Craig-Wood
ncw1@axis.demon.co.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 10:54 2.6.0 Huge pages not working as expected Nick Craig-Wood
@ 2003-12-26 11:56 ` William Lee Irwin III
  2003-12-26 20:10   ` Nick Craig-Wood
  0 siblings, 1 reply; 11+ messages in thread
From: William Lee Irwin III @ 2003-12-26 11:56 UTC (permalink / raw)
  To: Nick Craig-Wood; +Cc: linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 10:54:33AM +0000, Nick Craig-Wood wrote:
> I wrote a little test program to show the benefits of huge pages by
> reducing TLB thrashing - it fills up 16 MB with sequential numbers
> then adds them with different strides - very much the sort of thing
> FFTs do.  However huge pages show a performance decrease not increase
> for large strides!  For smaller ones there is a small speedup.
> I've been testing on
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 8
> model name      : Pentium III (Coppermine)

P-III has something like 2 TLB entries usable for large pages.
I recommend trying this again on a P-IV.


-- wli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 11:56 ` William Lee Irwin III
@ 2003-12-26 20:10   ` Nick Craig-Wood
  2003-12-26 20:15     ` William Lee Irwin III
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Nick Craig-Wood @ 2003-12-26 20:10 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 03:56:47AM -0800, William Lee Irwin III wrote:
> On Fri, Dec 26, 2003 at 10:54:33AM +0000, Nick Craig-Wood wrote:
> > I wrote a little test program to show the benefits of huge pages by
> > reducing TLB thrashing - it fills up 16 MB with sequential numbers
> > then adds them with different strides - very much the sort of thing
> > FFTs do.  However huge pages show a performance decrease not increase
> > for large strides!  For smaller ones there is a small speedup.
> > I've been testing on
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 8
> > model name      : Pentium III (Coppermine)
> 
> P-III has something like 2 TLB entries usable for large pages.
> I recommend trying this again on a P-IV.

I tried again on a P4

  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 15
  model           : 1
  model name      : Intel(R) Pentium(R) 4 CPU 1.70GHz
  stepping        : 2
  cpu MHz         : 1717.286
  cache size      : 256 KB
  fdiv_bug        : no
  hlt_bug         : no
  f00f_bug        : no
  coma_bug        : no
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 2
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
  bogomips        : 3383.29

The results are just about the same - a slight slowdown for
hugepages...

Testing memory at 0x40157008
span =        1, time =     11.727 ms, total = -2097152
span =        2, time =     21.997 ms, total = -2097152
span =        4, time =     37.835 ms, total = -2097152
span =        8, time =     71.097 ms, total = -2097152
span =       16, time =    149.218 ms, total = -2097152
span =       32, time =    284.334 ms, total = -2097152
span =       64, time =    287.300 ms, total = -2097152
span =      128, time =    294.139 ms, total = -2097152
span =      256, time =    307.001 ms, total = -2097152
span =      512, time =    337.929 ms, total = -2097152
span =     1024, time =    427.346 ms, total = -2097152
span =     2048, time =    483.303 ms, total = -2097152
span =     4096, time =    482.394 ms, total = -2097152

Memory from hugetlbfs
Testing memory at 0x41400000
span =        1, time =     11.567 ms, total = -2097152
span =        2, time =     21.339 ms, total = -2097152
span =        4, time =     37.473 ms, total = -2097152
span =        8, time =     70.646 ms, total = -2097152
span =       16, time =    148.426 ms, total = -2097152
span =       32, time =    283.675 ms, total = -2097152
span =       64, time =    286.539 ms, total = -2097152
span =      128, time =    293.116 ms, total = -2097152
span =      256, time =    305.257 ms, total = -2097152
span =      512, time =    338.163 ms, total = -2097152
span =     1024, time =    426.377 ms, total = -2097152
span =     2048, time =    483.237 ms, total = -2097152
span =     4096, time =    489.516 ms, total = -2097152

I tried to test your theory by altering the test to run over just 1
4MB page - this produced similar results on the P3 and P4. These from
the P4

Memory from malloc()
Testing memory at 0x40157008
span =        1, time =      3.178 ms, total = -524288
span =        2, time =      5.548 ms, total = -524288
span =        4, time =      9.509 ms, total = -524288
span =        8, time =     17.877 ms, total = -524288
span =       16, time =     37.348 ms, total = -524288
span =       32, time =     71.231 ms, total = -524288
span =       64, time =     72.066 ms, total = -524288
span =      128, time =     73.971 ms, total = -524288
span =      256, time =     77.575 ms, total = -524288
span =      512, time =     86.041 ms, total = -524288
span =     1024, time =    108.016 ms, total = -524288
span =     2048, time =    117.772 ms, total = -524288
span =     4096, time =    115.696 ms, total = -524288

Memory from hugetlbfs
Testing memory at 0x40800000
span =        1, time =      3.061 ms, total = -524288
span =        2, time =      5.419 ms, total = -524288
span =        4, time =      9.406 ms, total = -524288
span =        8, time =     17.731 ms, total = -524288
span =       16, time =     37.213 ms, total = -524288
span =       32, time =     70.973 ms, total = -524288
span =       64, time =     71.695 ms, total = -524288
span =      128, time =     73.393 ms, total = -524288
span =      256, time =     76.395 ms, total = -524288
span =      512, time =     84.490 ms, total = -524288
span =     1024, time =    106.709 ms, total = -524288
span =     2048, time =    120.795 ms, total = -524288
span =     4096, time =    122.431 ms, total = -524288

Any other ideas?

(Interesting note - the 700 MHz P3 laptop is nearly twice as fast as
the 1.7 GHx P4 dekstop (261ms vs 489ms) at the span 4096 case, but the
P4 beats the P3 by a factor of 23 for the stride 1 (3ms vs 71 ms)!)

-- 
Nick Craig-Wood
ncw1@axis.demon.co.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 20:10   ` Nick Craig-Wood
@ 2003-12-26 20:15     ` William Lee Irwin III
  2003-12-26 20:33     ` Linus Torvalds
  2004-01-06 14:24     ` Kurt Garloff
  2 siblings, 0 replies; 11+ messages in thread
From: William Lee Irwin III @ 2003-12-26 20:15 UTC (permalink / raw)
  To: Nick Craig-Wood; +Cc: linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 08:10:11PM +0000, Nick Craig-Wood wrote:
> Any other ideas?
> (Interesting note - the 700 MHz P3 laptop is nearly twice as fast as
> the 1.7 GHx P4 dekstop (261ms vs 489ms) at the span 4096 case, but the
> P4 beats the P3 by a factor of 23 for the stride 1 (3ms vs 71 ms)!)

Well, at this point I'd say point oprofile at it to try to figure
out what the overhead(s) are causing the degradation.


-- wli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 20:10   ` Nick Craig-Wood
  2003-12-26 20:15     ` William Lee Irwin III
@ 2003-12-26 20:33     ` Linus Torvalds
  2003-12-27  3:36       ` Andrea Arcangeli
  2003-12-27  9:01       ` Nick Craig-Wood
  2004-01-06 14:24     ` Kurt Garloff
  2 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2003-12-26 20:33 UTC (permalink / raw)
  To: Nick Craig-Wood; +Cc: William Lee Irwin III, linux-kernel, Rohit Seth



On Fri, 26 Dec 2003, Nick Craig-Wood wrote:
> 
> The results are just about the same - a slight slowdown for
> hugepages...

I don't think you are really testing the TLB - you are testing the data 
cache.

And the thing is, using huge pages will mean that the pages are 1:1
mapped, and thus get "perfectly" cache-coloured, while the anonymous mmap 
will give you random placement.

And what you are seeing is likely the fact that random placement is 
guaranteed to not have any worst-case behaviour. While perfect 
cache-coloring very much _does_ have worst-case schenarios, and you're 
likely triggering one of them.

In particular, using a pure power-of-two stride means that you are
limiting your cache to a certain subset of the full result with the
perfect coloring.

This, btw, is why I don't like page coloring: it does give nicely
reproducible results, but it does not necessarily improve performance.  
Random placement has a lot of advantages, one of which is a lot smoother
performance degradation - which I personally think is a good thing.

Try your program with non-power-of-two, and non-page-aligned strides. I
suspect the results will change (but I suspect that the TLB wins will 
still be pretty much in the noise compared to the actual data cache 
effects).

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 20:33     ` Linus Torvalds
@ 2003-12-27  3:36       ` Andrea Arcangeli
  2003-12-27  4:01         ` Linus Torvalds
  2003-12-27  9:01       ` Nick Craig-Wood
  1 sibling, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2003-12-27  3:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Craig-Wood, William Lee Irwin III, linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 12:33:58PM -0800, Linus Torvalds wrote:
> This, btw, is why I don't like page coloring: it does give nicely
> reproducible results, but it does not necessarily improve performance.  

static page coloring doesn't mean you've to map 1:1 (though with the
largepages there's no choice but 1:1 ;).

the best algorithm giving three digit percent improvements I tested with
Sebastien Cabaniols on some alpha last year is the below mode == 1 (the
mode is selectable both at runtime or boot time):

+       /*
+        * If pfn is negative just try to distribute the page colors globally
+        * with a dynamic page coloring.
+        */
+       color = pfn;
+       switch (page_coloring_mode) {
+       case 0:
+               break;
+       case 1:
+               /* when == 1 optimize FFT (avoids some cache trashing) */
+               color = color + (color / page_colors);
+               break;
+       }
+       if (pfn < 0)
+               color = global_color;


the perfect static page coloring 1:1 (mode == 0) was the worst IIRC at
some math algorithm walking matrix horizontally and vertically at the
same time, especially if every raw is a page or similar multiple, for
the reasons you just said. But the mode == 1 was the very best, much
better than random and 1:1.

> Random placement has a lot of advantages, one of which is a lot smoother

well, at least on the alpha the above mode = 1 is reproducibly a lot
better (we're talking about a wall time 2/3 times shorter IIRC) than
random placement. The l2 is huge and one way cache associative, we
couldn't reproduce the same results on a alpha with tiny caches and
16-way set associative or similar. Note the above has nothing to do with
the patches I've seen floating around for the last years.  Those are all
dynamic page coloring, the above does dynamic coloring of the kernel
code only, and it makes sure the dynamic coloring of the kernel is never
strict, while it can be strict for userspace optionally (strict means,
shrink the cache hard until if finds the asked color, which is a must
have feature on the alpha for the math apps with tiny vm working set and
lots of ram, though I'm sure the 'strict' mode would make no sense on
the x86, except during pure benchmarking where reproducible results are
valuable). It also colors the pagecache with the inode offset (plus a
random offset from the inode pointer IIRC).

I guess gcc developers and most other cpu-benchmarking efforts would
benefit from an algorithm like the above (plus the strict mode in the
same patch), so they can remove some (at least theoretical) noise from
the nightly spec runs. this ignoring the benfits on the non x86 archs.

The current patch is for 2.2 with an horrible API (it uses a kernel
module to set those params instead of a sysctl, despite all the real
code is linked into the kernel), while developing it I only focused on
the algorithms and the final behaviour in production. the engine to ask
the allocator a page of the right color works O(1) with the number of
free pages and it's from Jason.  the allocator engine is completely
shared between my implementation and the other patches floating around.
The engine was so well designed and correctly implemented that there was
no reason for me to touch it.  Really the implementation of the engine
could be cleaner but I didn't bother to clean it up.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-27  3:36       ` Andrea Arcangeli
@ 2003-12-27  4:01         ` Linus Torvalds
  2003-12-27  9:28           ` David S. Miller
  2003-12-27 15:58           ` Andrea Arcangeli
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2003-12-27  4:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Craig-Wood, William Lee Irwin III, linux-kernel, Rohit Seth



On Sat, 27 Dec 2003, Andrea Arcangeli wrote:
> 
> well, at least on the alpha the above mode = 1 is reproducibly a lot
> better (we're talking about a wall time 2/3 times shorter IIRC) than
> random placement. The l2 is huge and one way cache associative,

What kind of strange and misguided hw engineer did that?

I can understand a one-way L1, simply to keep the cycle time low, but 
what's the point of a one-way L2? Braindead external cache controller?

> The current patch is for 2.2 with an horrible API (it uses a kernel
> module to set those params instead of a sysctl, despite all the real
> code is linked into the kernel), while developing it I only focused on
> the algorithms and the final behaviour in production. the engine to ask
> the allocator a page of the right color works O(1) with the number of
> free pages and it's from Jason.

Does it keep fragmentation down?

That's the problem that Davem had in one of his cache-coloring patches: it
worked well enough if you had lots of memory, but it _totally_ broke down
when memory was low. You couldn't allocate higher-order pages at all after
a while because of the fragmented memory.

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 20:33     ` Linus Torvalds
  2003-12-27  3:36       ` Andrea Arcangeli
@ 2003-12-27  9:01       ` Nick Craig-Wood
  1 sibling, 0 replies; 11+ messages in thread
From: Nick Craig-Wood @ 2003-12-27  9:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 12:33:58PM -0800, Linus Torvalds wrote:
> On Fri, 26 Dec 2003, Nick Craig-Wood wrote:
> > 
> > The results are just about the same - a slight slowdown for
> > hugepages...
> 
> I don't think you are really testing the TLB - you are testing the data 
> cache.
> 
> And the thing is, using huge pages will mean that the pages are 1:1
> mapped, and thus get "perfectly" cache-coloured, while the anonymous mmap 
> will give you random placement.

Mmmm, yes.

> And what you are seeing is likely the fact that random placement is 
> guaranteed to not have any worst-case behaviour. While perfect 
> cache-coloring very much _does_ have worst-case schenarios, and you're 
> likely triggering one of them.
> 
> In particular, using a pure power-of-two stride means that you are
> limiting your cache to a certain subset of the full result with the
> perfect coloring.
> 
> This, btw, is why I don't like page coloring: it does give nicely
> reproducible results, but it does not necessarily improve performance.  
> Random placement has a lot of advantages, one of which is a lot smoother
> performance degradation - which I personally think is a good thing.
> 
> Try your program with non-power-of-two, and non-page-aligned strides. I
> suspect the results will change (but I suspect that the TLB wins will 
> still be pretty much in the noise compared to the actual data cache 
> effects).

Yes you are right and I should have thought have that as I know that
FFTs often have a bit of padding on each row to make them a non power
of two to avoid this effect!

Here are the results again with a some non power of two strides run on
a P4.  Apart from the variable results the hugetlb ones are always
less than the small page ones.

Memory from /dev/zero
Testing memory at 0x42400000
span =        1, time =     12.103 ms, total = -973807672
span =        2, time =     21.051 ms, total = -973807672
span =        3, time =     28.391 ms, total = -973807672
span =        5, time =     44.004 ms, total = -973807672
span =        7, time =     60.622 ms, total = -973807672
span =       11, time =     96.537 ms, total = -973807672
span =       13, time =    116.335 ms, total = -973807672
span =       17, time =    153.163 ms, total = -973807672
span =       33, time =    276.764 ms, total = -973807672
span =       77, time =    282.419 ms, total = -973807672
span =      119, time =    287.168 ms, total = -973807672
span =      221, time =    298.292 ms, total = -973807672
span =      561, time =    343.215 ms, total = -973807672
span =      963, time =    418.078 ms, total = -973807672
span =     1309, time =    446.026 ms, total = -973807672
span =     2023, time =    253.098 ms, total = -973807672
span =     4335, time =     68.616 ms, total = -973807672

Memory from hugetlbfs
Testing memory at 0x41400000
span =        1, time =     12.059 ms, total = -973807672
span =        2, time =     20.745 ms, total = -973807672
span =        3, time =     28.324 ms, total = -973807672
span =        5, time =     43.683 ms, total = -973807672
span =        7, time =     60.228 ms, total = -973807672
span =       11, time =     95.680 ms, total = -973807672
span =       13, time =    115.695 ms, total = -973807672
span =       17, time =    152.603 ms, total = -973807672
span =       33, time =    275.821 ms, total = -973807672
span =       77, time =    280.759 ms, total = -973807672
span =      119, time =    285.515 ms, total = -973807672
span =      221, time =    295.163 ms, total = -973807672
span =      561, time =    335.941 ms, total = -973807672
span =      963, time =    411.387 ms, total = -973807672
span =     1309, time =    433.168 ms, total = -973807672
span =     2023, time =    119.780 ms, total = -973807672
span =     4335, time =     32.085 ms, total = -973807672

Isn't modern memory management fun ;-)

-- 
Nick Craig-Wood
ncw1@axis.demon.co.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-27  4:01         ` Linus Torvalds
@ 2003-12-27  9:28           ` David S. Miller
  2003-12-27 15:58           ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: David S. Miller @ 2003-12-27  9:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: andrea, ncw1, wli, linux-kernel, rohit.seth

On Fri, 26 Dec 2003 20:01:57 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> I can understand a one-way L1, simply to keep the cycle time low, but 
> what's the point of a one-way L2? Braindead external cache controller?

Most sparc64's are the same, as are the ancient sparc32 chips.

Most R4000/R5000 mips chips are like this as well.

It's stupid, but unfortunately pervasive. :)

> Does it keep fragmentation down?
> 
> That's the problem that Davem had in one of his cache-coloring patches: it
> worked well enough if you had lots of memory, but it _totally_ broke down
> when memory was low. You couldn't allocate higher-order pages at all after
> a while because of the fragmented memory.

That's right, but it could also have been because my approach to
the implementation sucked.

For example, if you just keep breaking apart order 1 or greater chunks
to give particular colors out, and later some of them get freed and some
of them don't, you get less and less buddy coalescing over time.

One idea to combat this is to make page liberation (ie. vmscan and friends)
smarter about this when swapping, kicking out page cache pages, or whatever.
Ie. see which freed pages have buddies we can liberate.  I never experimented
with any ideas like that.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-27  4:01         ` Linus Torvalds
  2003-12-27  9:28           ` David S. Miller
@ 2003-12-27 15:58           ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2003-12-27 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Craig-Wood, William Lee Irwin III, linux-kernel, Rohit Seth

On Fri, Dec 26, 2003 at 08:01:57PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 27 Dec 2003, Andrea Arcangeli wrote:
> > 
> > well, at least on the alpha the above mode = 1 is reproducibly a lot
> > better (we're talking about a wall time 2/3 times shorter IIRC) than
> > random placement. The l2 is huge and one way cache associative,
> 
> What kind of strange and misguided hw engineer did that?

;)

> I can understand a one-way L1, simply to keep the cycle time low, but 
> what's the point of a one-way L2? Braindead external cache controller?

dunno, though if you implement the cache coloring with that algorithm I
it run fast compared to other archs. No idea how much hardware gain one
can get with an huge one way set associative cache (with some help from
the OS to ensure it's all used and not trashed) if compared to a (tiny)
16-way associative cache.

> 
> > The current patch is for 2.2 with an horrible API (it uses a kernel
> > module to set those params instead of a sysctl, despite all the real
> > code is linked into the kernel), while developing it I only focused on
> > the algorithms and the final behaviour in production. the engine to ask
> > the allocator a page of the right color works O(1) with the number of
> > free pages and it's from Jason.
> 
> Does it keep fragmentation down?
> 
> That's the problem that Davem had in one of his cache-coloring patches: it
> worked well enough if you had lots of memory, but it _totally_ broke down
> when memory was low. You couldn't allocate higher-order pages at all after
> a while because of the fragmented memory.

what can happen is that you've a leaf at order 0 but you don't take it
and you split some order 1-MAX_ORDER page instead to get an order 0 of the
right color (same can happen with order 1 vs order 2-MAX_ORDER). So yes,
it can fragment the memory more quickly, but the very same fragment
patterns can happen w/o page coloring, it's just quicker to get into a
fragmented state.

However it defragments down perfectly if two contigous order 0 pages are
free, no difference in such case, it's just that you may end with more
non mergeable order 0 pages more quickly, but it's all quite random and
I never heard problems about that. The kernel must free cache and
possibly swap too if it fails order > 0 allocations anyways, or the
order 1 could not succeed. the swapping/cache-shrinking basically
relocates the right colored pages into non right colored pages until the
defragmentation happened.

it should be simple to add a very weak mode (the opposite of the strict
mode) that forbids the color aware allocator to split an high order page
if there's at least one page already available of the right order. That
would provide an even lesser guarantee of right coloring though. that
could be the same as non coloring at all in practice, that's why I
probably not even considered it till today. It's probably simpler to
disable it via sysctl than to implement this "weak" mode.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0 Huge pages not working as expected
  2003-12-26 20:10   ` Nick Craig-Wood
  2003-12-26 20:15     ` William Lee Irwin III
  2003-12-26 20:33     ` Linus Torvalds
@ 2004-01-06 14:24     ` Kurt Garloff
  2 siblings, 0 replies; 11+ messages in thread
From: Kurt Garloff @ 2004-01-06 14:24 UTC (permalink / raw)
  To: Nick Craig-Wood; +Cc: Linux kernel list

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Fri, Dec 26, 2003 at 08:10:11PM +0000, Nick Craig-Wood wrote:
> (Interesting note - the 700 MHz P3 laptop is nearly twice as fast as
> the 1.7 GHx P4 dekstop (261ms vs 489ms) at the span 4096 case, but the
> P4 beats the P3 by a factor of 23 for the stride 1 (3ms vs 71 ms)!)

The factor is 6.2 (11.5ms vs 71ms), which is 2 * 1700/550.
Still impressive: P4 is doing twice the work per cycle compared to PIII.

Regards,
-- 
Kurt Garloff                   <kurt@garloff.de>             [Koeln, DE]
Physics:Plasma modeling <garloff@plasimo.phys.tue.nl> [TU Eindhoven, NL]
Linux:SCSI, Security           <garloff@suse.de>    [SUSE Nuernberg, DE]

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-01-06 14:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-26 10:54 2.6.0 Huge pages not working as expected Nick Craig-Wood
2003-12-26 11:56 ` William Lee Irwin III
2003-12-26 20:10   ` Nick Craig-Wood
2003-12-26 20:15     ` William Lee Irwin III
2003-12-26 20:33     ` Linus Torvalds
2003-12-27  3:36       ` Andrea Arcangeli
2003-12-27  4:01         ` Linus Torvalds
2003-12-27  9:28           ` David S. Miller
2003-12-27 15:58           ` Andrea Arcangeli
2003-12-27  9:01       ` Nick Craig-Wood
2004-01-06 14:24     ` Kurt Garloff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.