linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] patch-slab-split-03-tail
@ 2002-10-04 17:04 Manfred Spraul
  2002-10-04 19:06 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Manfred Spraul @ 2002-10-04 17:04 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: mbligh

[-- Attachment #1: Type: text/plain, Size: 385 bytes --]

part 3:
[depends on -02-SMP]

If an object is freed from a slab, then move the slab to the tail of the 
partial list - this should increase the probability that the other 
objects from the same page are freed, too, and that a page can be 
returned to gfp later.

The cpu arrays are now always in front of the list, i.e. cache hit rates 
should not matter.


Please apply

--
	Manfred


[-- Attachment #2: patch-slab-split-03-tail --]
[-- Type: text/plain, Size: 331 bytes --]

--- 2.5/mm/slab.c	Fri Oct  4 18:59:01 2002
+++ build-2.5/mm/slab.c	Fri Oct  4 18:59:11 2002
@@ -1478,7 +1478,7 @@
 		} else if (unlikely(inuse == cachep->num)) {
 			/* Was full. */
 			list_del(&slabp->list);
-			list_add(&slabp->list, &cachep->slabs_partial);
+			list_add_tail(&slabp->list, &cachep->slabs_partial);
 		}
 	}
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 17:04 [PATCH] patch-slab-split-03-tail Manfred Spraul
@ 2002-10-04 19:06 ` Andrew Morton
  2002-10-04 19:07   ` Martin J. Bligh
  2002-10-04 19:15   ` Manfred Spraul
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2002-10-04 19:06 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, mbligh

Manfred Spraul wrote:
> 
> part 3:
> [depends on -02-SMP]
> 
> If an object is freed from a slab, then move the slab to the tail of the
> partial list - this should increase the probability that the other
> objects from the same page are freed, too, and that a page can be
> returned to gfp later.
> 
> The cpu arrays are now always in front of the list, i.e. cache hit rates
> should not matter.
> 

Run that by me again?  So we're saying "if we just freed an
object from this page then make this page be the *last* page
which is eligible for new allocations"?  Under the assumption
that other objects in that same page are about to be freed
up as well?

Makes sense.  It would be nice to get this confirmed in 
targetted testing ;)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 19:06 ` Andrew Morton
@ 2002-10-04 19:07   ` Martin J. Bligh
  2002-10-04 19:15     ` Andrew Morton
  2002-10-04 19:15   ` Manfred Spraul
  1 sibling, 1 reply; 10+ messages in thread
From: Martin J. Bligh @ 2002-10-04 19:07 UTC (permalink / raw)
  To: Andrew Morton, Manfred Spraul; +Cc: linux-kernel

> Run that by me again?  So we're saying "if we just freed an
> object from this page then make this page be the *last* page
> which is eligible for new allocations"?  Under the assumption
> that other objects in that same page are about to be freed
> up as well?
> 
> Makes sense.  It would be nice to get this confirmed in 
> targetted testing ;)

Just doing my normal boring kernel compile suggest Manfred's 
last big rollup performs exactly the same as without it. Not
sure if that's any help or not ....

M.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 19:07   ` Martin J. Bligh
@ 2002-10-04 19:15     ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2002-10-04 19:15 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Manfred Spraul, linux-kernel

"Martin J. Bligh" wrote:
> 
> > Run that by me again?  So we're saying "if we just freed an
> > object from this page then make this page be the *last* page
> > which is eligible for new allocations"?  Under the assumption
> > that other objects in that same page are about to be freed
> > up as well?
> >
> > Makes sense.  It would be nice to get this confirmed in
> > targetted testing ;)
> 
> Just doing my normal boring kernel compile suggest Manfred's
> last big rollup performs exactly the same as without it. Not
> sure if that's any help or not ....
> 

Well.  This patch is supposed to decrease internal fragmentation.
We need to prove that theory.  An appropriate test would be:

- boot with `mem=48m'
- untar kernel
- build kernel
- capture /proc/slabinfo

- apply patch

- repeat

- compare and explain.

I know what your reboot times are like ;)  I'll do it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 19:06 ` Andrew Morton
  2002-10-04 19:07   ` Martin J. Bligh
@ 2002-10-04 19:15   ` Manfred Spraul
  2002-10-04 20:22     ` Randy.Dunlap
  1 sibling, 1 reply; 10+ messages in thread
From: Manfred Spraul @ 2002-10-04 19:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mbligh

Andrew Morton wrote:
> 
> Makes sense.  It would be nice to get this confirmed in 
> targetted testing ;)
 >
Not yet done.

The right way to test it would be to collect data in kernel about 
alloc/free, and then run that data against both versions, and check 
which version gives less internal fragmentation.

Or perhaps Bonwick has done that for his slab paper, but I don't have it :-(

* An implementation of the Slab Allocator as described in outline in;
*      UNIX Internals: The New Frontiers by Uresh Vahalia
*      Pub: Prentice Hall      ISBN 0-13-101908-2
* or with a little more detail in;
*      The Slab Allocator: An Object-Caching Kernel Memory Allocator
*      Jeff Bonwick (Sun Microsystems).
*      Presented at: USENIX Summer 1994 Technical Conference


--
	Manfred



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 19:15   ` Manfred Spraul
@ 2002-10-04 20:22     ` Randy.Dunlap
  2002-10-04 21:25       ` Manfred Spraul
  0 siblings, 1 reply; 10+ messages in thread
From: Randy.Dunlap @ 2002-10-04 20:22 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrew Morton, linux-kernel, mbligh

On Fri, 4 Oct 2002, Manfred Spraul wrote:

| Andrew Morton wrote:
| >
| > Makes sense.  It would be nice to get this confirmed in
| > targetted testing ;)
|  >
| Not yet done.
|
| The right way to test it would be to collect data in kernel about
| alloc/free, and then run that data against both versions, and check
| which version gives less internal fragmentation.
|
| Or perhaps Bonwick has done that for his slab paper, but I don't have it :-(

Did you look at http://www.usenix.org/events/usenix01/bonwick.html
for it?

| * An implementation of the Slab Allocator as described in outline in;
| *      UNIX Internals: The New Frontiers by Uresh Vahalia
| *      Pub: Prentice Hall      ISBN 0-13-101908-2
| * or with a little more detail in;
| *      The Slab Allocator: An Object-Caching Kernel Memory Allocator
| *      Jeff Bonwick (Sun Microsystems).
| *      Presented at: USENIX Summer 1994 Technical Conference
| --

-- 
~Randy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 20:22     ` Randy.Dunlap
@ 2002-10-04 21:25       ` Manfred Spraul
  2002-10-04 21:43         ` Robert Love
  2002-10-05  0:14         ` Anton Blanchard
  0 siblings, 2 replies; 10+ messages in thread
From: Manfred Spraul @ 2002-10-04 21:25 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Andrew Morton, linux-kernel, mbligh

Randy.Dunlap wrote:
> 
> Did you look at http://www.usenix.org/events/usenix01/bonwick.html
> for it?
> 
Thanks for the link - that describes the newer, per-cpu extensions to 
slab. Quite similar to the Linux implementation.

The text also contains a link to the original paper:

http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html

Bonwick used one partially sorted list [as linux in 2.2, and 2.4.<10], 
instead of seperate lists - move tail was not an option.

The new paper contains one interesting comment:
<<<<<<<
An object cache's CPU layer contains per-CPU state that must be 
protected either by per-CPU locking or by disabling interrupts. We 
selected per-CPU locking for several reasons:
[...]
  x    Performance. On most modern processors, grabbing an uncontended 
lock is cheaper than modifying the processor interrupt level.
<<<<<<<<

Which cpus have slow local_irq_disable() implementations? At least for 
my Duron, this doesn't seem to be the case [~ 4 cpu cycles for cli]


--
	Manfred


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 21:25       ` Manfred Spraul
@ 2002-10-04 21:43         ` Robert Love
  2002-10-04 22:30           ` Manfred Spraul
  2002-10-05  0:14         ` Anton Blanchard
  1 sibling, 1 reply; 10+ messages in thread
From: Robert Love @ 2002-10-04 21:43 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Randy.Dunlap, Andrew Morton, linux-kernel, mbligh

On Fri, 2002-10-04 at 17:25, Manfred Spraul wrote:

> Which cpus have slow local_irq_disable() implementations? At least
> for  my Duron, this doesn't seem to be the case [~ 4 cpu cycles
> for cli]

I believe there are pipeline effects to disabling interrupts, e.g. it
has to be flushed?

	Robert Love


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 21:43         ` Robert Love
@ 2002-10-04 22:30           ` Manfred Spraul
  0 siblings, 0 replies; 10+ messages in thread
From: Manfred Spraul @ 2002-10-04 22:30 UTC (permalink / raw)
  To: Robert Love; +Cc: Randy.Dunlap, Andrew Morton, linux-kernel, mbligh

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

Robert Love wrote:
> On Fri, 2002-10-04 at 17:25, Manfred Spraul wrote:
> 
> 
>>Which cpus have slow local_irq_disable() implementations? At least
>>for  my Duron, this doesn't seem to be the case [~ 4 cpu cycles
>>for cli]
> 
> 
> I believe there are pipeline effects to disabling interrupts, e.g. it
> has to be flushed?
> 
At least my Duron [700 MHz] obviously doesn't flush the pipeline. If the 
Pentium 4 flushes his pipeline, it could mean 20+ cycles - test app is 
attached.

--
	Manfred

[-- Attachment #2: cli.cpp --]
[-- Type: text/plain, Size: 3296 bytes --]

/*
 * cli.cpp: RDTSC based performance tester.
 *
 * Copyright (C) 1999, 2001, 2002 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header: /pub/home/manfred/cvs-tree/timetest/cli.cpp,v 1.4 2002/10/04 21:22:09 manfred Exp $
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>

// define a cache flushing function
#undef CACHE_FLUSH

// Intel recommends that a serializing instruction
// should be called before and after rdtsc.
// CPUID is a serializing instruction.
// ".align 128:" P 4 L2 cache line size
#define read_rdtsc_before(time)		\
	__asm__ __volatile__(		\
		".align 128\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		"rdtsc\n\t"		\
		"mov %%eax,(%0)\n\t"	\
		"mov %%edx,4(%0)\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		: /* no output */	\
		: "S"(&time)		\
		: "eax", "ebx", "ecx", "edx", "memory")

#define read_rdtsc_after(time)		\
	__asm__ __volatile__(		\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		"rdtsc\n\t"		\
		"mov %%eax,(%0)\n\t"	\
		"mov %%edx,4(%0)\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		"sti\n\t"		\
		: /* no output */	\
		: "S"(&time)		\
		: "eax", "ebx", "ecx", "edx", "memory")

#define BUILD_TESTFNC(name, text, instructions) \
void name##_dummy(void)				\
{						\
	__asm__ __volatile__(			\
		".align 4096\n\t"		\
		"xor %%eax, %%eax\n\t"		\
		: : : "eax");			\
}						\
static unsigned long name##_best = 1024*1024*1024; \
\
static void name(void) \
{ \
	unsigned long long time; \
	unsigned long long time2; \
 \
	read_rdtsc_before(time); \
	instructions; \
	read_rdtsc_after(time2); \
	if(time2-time < name##_best) { \
		printf( text ":\t%10Ld ticks; \n", \
			time2-time-zerotest_best); \
		name##_best = time2-time; \
	} \
}

void filler(void)
{
static int i = 3;
static int j;
	j = i*i;
}

#define DO_3(x) \
	do { x; x; x; } while(0)

#define DO_10(x) \
	do { x; x; x; x; x; x; x; x; x; x;} while(0)

#define DO_50(x) \
	do { DO_10(x); DO_10(x);DO_10(x); DO_10(x);DO_10(x);} while(0)


#define DO_T(y) do { \
	DO_3(filler()); \
	y; \
	DO_3(filler());} while(0)

#ifdef CACHE_FLUSH
#define DRAIN_SZ	(4*1024*1024)
int other[3*DRAIN_SZ] __attribute ((aligned (4096)));
static inline void drain_cache(void)
{
	int i;
	for(i=0;i<DRAIN_SZ;i++) other[DRAIN_SZ+i]=0;
	for(i=0;i<DRAIN_SZ;i++) if(other[DRAIN_SZ+i]!=0) break;
}
#else
static inline void drain_cache(void)
{
}
#endif

#define DO_TEST(x) \
	do { \
		int i; \
		for(i=0;i<500000;i++) \
			x; \
	} while(0)

//////////////////////////////////////////////////////////////////////////////

static inline void nothing()
{
	__asm__ __volatile__("nop": : : "memory");
}

BUILD_TESTFNC(zerotest,"zerotest", DO_T(nothing()));

//////////////////////////////////////////////////////////////////////////////

static inline void test0()
{
	__asm__ __volatile__("cli": : : "memory");
}

BUILD_TESTFNC(test_0, "cli", DO_T(test0()))

//////////////////////////////////////////////////////////////////////////////
extern "C" int iopl __P ((int __level));

int main()
{
	printf("CLI bench\n");
	iopl(3);

	for(;;) {
		DO_TEST(zerotest());
		DO_TEST(test_0());
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] patch-slab-split-03-tail
  2002-10-04 21:25       ` Manfred Spraul
  2002-10-04 21:43         ` Robert Love
@ 2002-10-05  0:14         ` Anton Blanchard
  1 sibling, 0 replies; 10+ messages in thread
From: Anton Blanchard @ 2002-10-05  0:14 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Randy.Dunlap, Andrew Morton, linux-kernel, mbligh


> <<<<<<<
> An object cache's CPU layer contains per-CPU state that must be 
> protected either by per-CPU locking or by disabling interrupts. We 
> selected per-CPU locking for several reasons:
> [...]
>  x    Performance. On most modern processors, grabbing an uncontended 
> lock is cheaper than modifying the processor interrupt level.
> <<<<<<<<
> 
> Which cpus have slow local_irq_disable() implementations? At least for 
> my Duron, this doesn't seem to be the case [~ 4 cpu cycles for cli]

Rusty did some tests and found on the intel chips he tested
local_irq_disable was slower. He posted the results to lkml a few weeks
ago.

On ppc64 it varies between chips.

Anton

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-10-05  1:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-04 17:04 [PATCH] patch-slab-split-03-tail Manfred Spraul
2002-10-04 19:06 ` Andrew Morton
2002-10-04 19:07   ` Martin J. Bligh
2002-10-04 19:15     ` Andrew Morton
2002-10-04 19:15   ` Manfred Spraul
2002-10-04 20:22     ` Randy.Dunlap
2002-10-04 21:25       ` Manfred Spraul
2002-10-04 21:43         ` Robert Love
2002-10-04 22:30           ` Manfred Spraul
2002-10-05  0:14         ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).