linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATH] slab cleanup
@ 2002-09-29 12:43 Manfred Spraul
  2002-09-29 13:56 ` Ed Tomlinson
  2002-09-30  6:55 ` Martin J. Bligh
  0 siblings, 2 replies; 4+ messages in thread
From: Manfred Spraul @ 2002-09-29 12:43 UTC (permalink / raw)
  To: lse-tech
  Cc: Martin J. Bligh, akpm, tomlins, Kamble, Nitin A, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2396 bytes --]

I think we should make some cleanups before adding NUMA:

a) make the per-cpu arrays unconditional, even on UP.
	The arrays provides LIFO ordering, which should improve
	cache hit rates. The disadvantage is probably a higher
	internal fragmentation [not measured], but memory is cheap,
	and cache hit are important.
	In addition, this makes it possible to perform some cleanups.
	No more kmem_cache_alloc_one, for example.

b) use arrays for all caches, even with large objects.
	E.g. right now, the 16 kB loopback socket buffers
	are not handled per-cpu.

Nitin, in your NUMA patch, you return each "wrong-node" object
immediately, without batching them together. Have you considered
batching? It could reduce the number of spinlock operations dramatically.

There is one additional problem without an obvious solution:
kmem_cache_reap() is too inefficient. Before 2.5.39, it served 2 purposes:

1) return freeable slabs back to gfp.

	<2.5.39 scans through the caches in every try_to_free_pages.
	That scan is terribly inefficient, and akpm noticed lots of
	idle reschedules on the cache_chain_sem

	2.5.39 limits the freeable slabs list to one entry, and doesn't
	scan at all. On the one hand, this locks up one slab in each
	cache [in the worst case, a order=5 allocation]. For very
	bursty, slightly fragmented caches, it could lead to
	more kmem_cache_grow().

	My patch scans every 2 seconds, and return 80% of the pages.

2) regularly flush the per-cpu arrays
	<2.5.39 does that in every try_to_free_pages.

	My patch does that evey 2 seconds, on a random cpu.
	
	2.5.39 does that never. It probably reduces cpucache trashing
	[alloc batch of 120 objects, return one object, release
	batch of 119 objects due to th next try_to_free_pages]
	The problem is that without flushing, the cpuarrays can
	lock up lots of pages.[in the worst case, several thousand
	for each cpu].

The attached patch is WIP:
* boots on UP
* partially boots with bochs [cpu simulator], until bochs locks up.
* contains comments, where I'd add modifications for NUMA.

What do you think?
For NUMA, is it possible to figure out efficiently into which node a
pointer points? That would happen in every kmem_cache_free().

Could someone test that it works on real SMP?

What the best replacement for kmem_cache_reap()? 2.5.39 contains
obviously the simplest approach, but I'm not sure if it's the best.

--
	Manfred



[-- Attachment #2: patch-slab-LIFO --]
[-- Type: text/plain, Size: 51804 bytes --]

--- 2.5/mm/slab.c	Sat Sep 28 12:55:16 2002
+++ build-2.5/mm/slab.c	Sun Sep 29 13:16:49 2002
@@ -8,6 +8,9 @@
  * Major cleanup, different bufctl logic, per-cpu arrays
  *	(c) 2000 Manfred Spraul
  *
+ * Cleanup, make the head arrays unconditional, preparation for NUMA
+ * 	(c) 2002 Manfred Spraul
+ *
  * An implementation of the Slab Allocator as described in outline in;
  *	UNIX Internals: The New Frontiers by Uresh Vahalia
  *	Pub: Prentice Hall	ISBN 0-13-101908-2
@@ -16,7 +19,6 @@
  *	Jeff Bonwick (Sun Microsystems).
  *	Presented at: USENIX Summer 1994 Technical Conference
  *
- *
  * The memory is organized in caches, one cache for each object type.
  * (e.g. inode_cache, dentry_cache, buffer_head, vm_area_struct)
  * Each cache consists out of many slabs (they are small (usually one
@@ -38,12 +40,14 @@
  * kmem_cache_destroy() CAN CRASH if you try to allocate from the cache
  * during kmem_cache_destroy(). The caller must prevent concurrent allocs.
  *
- * On SMP systems, each cache has a short per-cpu head array, most allocs
+ * Each cache has a short per-cpu head array, most allocs
  * and frees go into that array, and if that array overflows, then 1/2
  * of the entries in the array are given back into the global cache.
- * This reduces the number of spinlock operations.
+ * The head array is strictly LIFO and should improve the cache hit rates.
+ * On SMP, it additionally reduces the spinlock operations.
  *
- * The c_cpuarray may not be read with enabled local interrupts.
+ * The c_cpuarray may not be read with enabled local interrupts - 
+ * it's changed with a smp_call_function().
  *
  * SMP synchronization:
  *  constructors and destructors are called without any locking.
@@ -53,6 +57,10 @@
  *  	and local interrupts are disabled so slab code is preempt-safe.
  *  The non-constant members are protected with a per-cache irq spinlock.
  *
+ * Many thanks to Mark Hemment, who wrote another per-cpu slab patch
+ * in 2000 - many ideas in the current implementation are derived from
+ * his patch.
+ *
  * Further notes from the original documentation:
  *
  * 11 April '97.  Started multi-threading - markhe
@@ -61,10 +69,6 @@
  *	can never happen inside an interrupt (kmem_cache_create(),
  *	kmem_cache_shrink() and kmem_cache_reap()).
  *
- *	To prevent kmem_cache_shrink() trying to shrink a 'growing' cache (which
- *	maybe be sleeping and therefore not holding the semaphore/lock), the
- *	growing field is used.  This also prevents reaping from a cache.
- *
  *	At present, each engine can be growing a cache.  This should be blocked.
  *
  */
@@ -77,6 +81,7 @@
 #include	<linux/init.h>
 #include	<linux/compiler.h>
 #include	<linux/seq_file.h>
+#include	<linux/notifier.h>
 #include	<asm/uaccess.h>
 
 /*
@@ -100,11 +105,8 @@
 #define	FORCED_DEBUG	0
 #endif
 
-/*
- * Parameters for kmem_cache_reap
- */
-#define REAP_SCANLEN	10
-#define REAP_PERFECT	10
+
+static unsigned long g_cache_count;
 
 /* Shouldn't this be in a header file somewhere? */
 #define	BYTES_PER_WORD		sizeof(void *)
@@ -170,39 +172,91 @@
  * cpucache_t
  *
  * Per cpu structures
+ * Purpose:
+ * - LIFO ordering, to hand out cache-warm objects from _alloc
+ * - reduce spinlock operations
+ *
  * The limit is stored in the per-cpu structure to reduce the data cache
  * footprint.
+ * On NUMA systems, 2 per-cpu structures exist: one for the current
+ * node, one for wrong node free calls.
+ * Memory from the wrong node is never returned by alloc, it's returned
+ * to the home node as soon as the cpu cache is filled
+ *
  */
 typedef struct cpucache_s {
 	unsigned int avail;
 	unsigned int limit;
+	unsigned int batchcount;
 } cpucache_t;
 
+/* bootstrap: The caches do not work without cpuarrays anymore,
+ * but the cpuarrays are allocated from the generic caches...
+ */
+#define BOOT_CPUCACHE_ENTRIES	1
+struct cpucache_int {
+	cpucache_t cache;
+	void * entries[BOOT_CPUCACHE_ENTRIES];
+};
+
 #define cc_entry(cpucache) \
 	((void **)(((cpucache_t*)(cpucache))+1))
 #define cc_data(cachep) \
 	((cachep)->cpudata[smp_processor_id()])
 /*
+ * NUMA: check if 'ptr' points into the current node,
+ * 	use the alternate cpudata cache if wrong
+ */
+#define cc_data_ptr(cachep, ptr) \
+		cc_data(cachep)
+
+/*
+ * The slab lists of all objects.
+ * They provide some aging, and hopefully reduce the 
+ * internal fragmentation
+ * TODO: The spinlock should be moved from the kmem_cache_t
+ * into this structure, too.
+ */
+struct kmem_list3 {
+	struct list_head	slabs_partial;	/* partial list first, better asm code */
+	struct list_head	slabs_full;
+	struct list_head	slabs_free;
+};
+
+#define LIST3_INIT(parent) \
+	{ \
+		.slabs_full	= LIST_HEAD_INIT(parent.slabs_full), \
+		.slabs_partial	= LIST_HEAD_INIT(parent.slabs_partial), \
+		.slabs_free	= LIST_HEAD_INIT(parent.slabs_free) \
+	}
+#define list3_data(cachep) \
+	(&(cachep)->lists)
+
+/* NUMA: per-node */
+#define list3_data_ptr(cachep, ptr) \
+		list3_data(cachep)
+
+/*
  * kmem_cache_t
  *
  * manages a cache.
  */
-
+	
 struct kmem_cache_s {
-/* 1) each alloc & free */
-	/* full, partial first, then free */
-	struct list_head	slabs_full;
-	struct list_head	slabs_partial;
-	struct list_head	slabs_free;
+/* 1) per-cpu data, touched during every alloc/free */
+	cpucache_t		*cpudata[NR_CPUS];
+	/* NUMA: cpucache_t	*cpudata_othernode[NR_CPUS]; */
+	unsigned int		batchcount;
+	unsigned int		limit;
+/* 2) touched by every alloc & free from the backend */
+	struct kmem_list3	lists;
+	/* NUMA: kmem_3list_t	*nodelists[NR_NODES] */
 	unsigned int		objsize;
 	unsigned int	 	flags;	/* constant flags */
 	unsigned int		num;	/* # of objs per slab */
 	spinlock_t		spinlock;
-#ifdef CONFIG_SMP
-	unsigned int		batchcount;
-#endif
 
-/* 2) slab additions /removals */
+/* 3) kmem_cache_grow/shrink */
 	/* order of pgs per slab (2^n) */
 	unsigned int		gfporder;
 
@@ -213,7 +267,6 @@
 	unsigned int		colour_off;	/* colour offset */
 	unsigned int		colour_next;	/* cache colouring */
 	kmem_cache_t		*slabp_cache;
-	unsigned int		growing;
 	unsigned int		dflags;		/* dynamic flags */
 
 	/* constructor func */
@@ -222,15 +275,11 @@
 	/* de-constructor func */
 	void (*dtor)(void *, kmem_cache_t *, unsigned long);
 
-	unsigned long		failures;
-
-/* 3) cache creation/removal */
+/* 4) cache creation/removal */
 	const char		*name;
 	struct list_head	next;
-#ifdef CONFIG_SMP
-/* 4) per-cpu data */
-	cpucache_t		*cpudata[NR_CPUS];
-#endif
+
+/* 5) statistics */
 #if STATS
 	unsigned long		num_active;
 	unsigned long		num_allocations;
@@ -238,13 +287,12 @@
 	unsigned long		grown;
 	unsigned long		reaped;
 	unsigned long 		errors;
-#ifdef CONFIG_SMP
+	unsigned long		max_freeable;
 	atomic_t		allochit;
 	atomic_t		allocmiss;
 	atomic_t		freehit;
 	atomic_t		freemiss;
 #endif
-#endif
 };
 
 /* internal c_flags */
@@ -268,6 +316,15 @@
 					(x)->high_mark = (x)->num_active; \
 				} while (0)
 #define	STATS_INC_ERR(x)	((x)->errors++)
+#define	STATS_SET_FREEABLE(x, i) \
+				do { if ((x)->max_freeable < i) \
+					(x)->max_freeable = i; \
+				} while (0)
+#
+#define STATS_INC_ALLOCHIT(x)	atomic_inc(&(x)->allochit)
+#define STATS_INC_ALLOCMISS(x)	atomic_inc(&(x)->allocmiss)
+#define STATS_INC_FREEHIT(x)	atomic_inc(&(x)->freehit)
+#define STATS_INC_FREEMISS(x)	atomic_inc(&(x)->freemiss)
 #else
 #define	STATS_INC_ACTIVE(x)	do { } while (0)
 #define	STATS_DEC_ACTIVE(x)	do { } while (0)
@@ -276,18 +333,14 @@
 #define	STATS_INC_REAPED(x)	do { } while (0)
 #define	STATS_SET_HIGH(x)	do { } while (0)
 #define	STATS_INC_ERR(x)	do { } while (0)
-#endif
+#define	STATS_SET_FREEABLE(x, i) \
+				do { } while (0)
 
-#if STATS && defined(CONFIG_SMP)
-#define STATS_INC_ALLOCHIT(x)	atomic_inc(&(x)->allochit)
-#define STATS_INC_ALLOCMISS(x)	atomic_inc(&(x)->allocmiss)
-#define STATS_INC_FREEHIT(x)	atomic_inc(&(x)->freehit)
-#define STATS_INC_FREEMISS(x)	atomic_inc(&(x)->freemiss)
-#else
 #define STATS_INC_ALLOCHIT(x)	do { } while (0)
 #define STATS_INC_ALLOCMISS(x)	do { } while (0)
 #define STATS_INC_FREEHIT(x)	do { } while (0)
 #define STATS_INC_FREEMISS(x)	do { } while (0)
+
 #endif
 
 #if DEBUG
@@ -382,11 +435,15 @@
 }; 
 #undef CN
 
+struct cpucache_int cpuarray_cache __initdata = { { 0, BOOT_CPUCACHE_ENTRIES, 1} };
+struct cpucache_int cpuarray_generic __initdata = { { 0, BOOT_CPUCACHE_ENTRIES, 1} };
+
 /* internal cache of cache description objs */
 static kmem_cache_t cache_cache = {
-	.slabs_full	= LIST_HEAD_INIT(cache_cache.slabs_full),
-	.slabs_partial	= LIST_HEAD_INIT(cache_cache.slabs_partial),
-	.slabs_free	= LIST_HEAD_INIT(cache_cache.slabs_free),
+	.lists		= LIST3_INIT(cache_cache.lists),
+	.cpudata	= { [0] = &cpuarray_cache.cache },
+	.batchcount	= 1,
+	.limit		= BOOT_CPUCACHE_ENTRIES,
 	.objsize	= sizeof(kmem_cache_t),
 	.flags		= SLAB_NO_REAP,
 	.spinlock	= SPIN_LOCK_UNLOCKED,
@@ -402,16 +459,21 @@
 
 #define cache_chain (cache_cache.next)
 
-#ifdef CONFIG_SMP
 /*
  * chicken and egg problem: delay the per-cpu array allocation
  * until the general caches are up.
  */
-static int g_cpucache_up;
+enum {
+	NONE,
+	PARTIAL,
+	FULL
+} g_cpucache_up;
+
+static struct timer_list reap_timer;
+static void reap_timer_fnc(unsigned long data);
 
 static void enable_cpucache (kmem_cache_t *cachep);
 static void enable_all_cpucaches (void);
-#endif
 
 /* Cal the num objs, wastage, and bytes left over for a given slab size. */
 static void kmem_cache_estimate (unsigned long gfporder, size_t size,
@@ -441,6 +503,46 @@
 	*left_over = wastage;
 }
 
+static int __devinit cpuup_callback(struct notifier_block *nfb,
+				  unsigned long action,
+				  void *hcpu)
+{
+	int cpu = (int)hcpu;
+	if (action == CPU_ONLINE) {
+		struct list_head *p;
+		cpucache_t *nc;
+
+		down(&cache_chain_sem);
+
+		p = &cache_cache.next;
+		do {
+			int memsize;
+
+			kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
+			memsize = sizeof(void*)*cachep->limit+sizeof(cpucache_t);
+			nc = kmalloc(memsize, GFP_KERNEL);
+			if (!nc)
+				goto bad;
+			nc->avail = 0;
+			nc->limit = cachep->limit;
+			nc->batchcount = cachep->batchcount;
+
+			cachep->cpudata[cpu] = nc;
+
+			p = cachep->next.next;
+		} while (p != &cache_cache.next);
+
+		up(&cache_chain_sem);
+	}
+
+	return NOTIFY_OK;
+bad:
+	up(&cache_chain_sem);
+	return NOTIFY_BAD;
+}
+
+static struct notifier_block cpucache_notifier = { &cpuup_callback, NULL, 0 };
+
 /* Initialisation - setup the `cache' cache. */
 void __init kmem_cache_init(void)
 {
@@ -456,6 +558,22 @@
 
 	cache_cache.colour = left_over/cache_cache.colour_off;
 	cache_cache.colour_next = 0;
+
+	g_cache_count = 1;
+
+	/* Register a cpu startup notifier callback
+	 * that initializes cc_data for all new cpus
+	 */
+
+	register_cpu_notifier(&cpucache_notifier);
+	
+	/* register the timer that return unneeded
+	 * pages to gfp.
+	 */
+	init_timer(&reap_timer);
+	reap_timer.expires = jiffies + HZ;
+	reap_timer.function = reap_timer_fnc;
+	add_timer(&reap_timer);
 }
 
 
@@ -465,6 +583,7 @@
 void __init kmem_cache_sizes_init(void)
 {
 	cache_sizes_t *sizes = cache_sizes;
+
 	/*
 	 * Fragmentation resistance on low memory - only use bigger
 	 * page orders on machines with more than 32MB of memory.
@@ -497,14 +616,34 @@
 			BUG();
 		sizes++;
 	} while (sizes->cs_size);
+	/*
+	 * The generic caches are running - time to kick out the
+	 * bootstrap cpucaches.
+	 */
+	{
+		void * ptr;
+		
+		ptr = kmalloc(sizeof(struct cpucache_int), GFP_KERNEL);
+		local_irq_disable();
+		BUG_ON(cc_data(&cache_cache) != &cpuarray_cache.cache);
+		memcpy(ptr, cc_data(&cache_cache), sizeof(struct cpucache_int));
+		cc_data(&cache_cache) = ptr;
+		local_irq_enable();
+	
+		ptr = kmalloc(sizeof(struct cpucache_int), GFP_KERNEL);
+		local_irq_disable();
+		BUG_ON(cc_data(cache_sizes[0].cs_cachep) != &cpuarray_generic.cache);
+		memcpy(ptr, cc_data(cache_sizes[0].cs_cachep),
+				sizeof(struct cpucache_int));
+		cc_data(cache_sizes[0].cs_cachep) = ptr;
+		local_irq_enable();
+	}
 }
 
 int __init kmem_cpucache_init(void)
 {
-#ifdef CONFIG_SMP
-	g_cpucache_up = 1;
+	g_cpucache_up = FULL;
 	enable_all_cpucaches();
-#endif
 	return 0;
 }
 
@@ -552,7 +691,7 @@
 }
 
 #if DEBUG
-static inline void kmem_poison_obj (kmem_cache_t *cachep, void *addr)
+static void kmem_poison_obj (kmem_cache_t *cachep, void *addr)
 {
 	int size = cachep->objsize;
 	if (cachep->flags & SLAB_RED_ZONE) {
@@ -563,7 +702,7 @@
 	*(unsigned char *)(addr+size-1) = POISON_END;
 }
 
-static inline int kmem_check_poison_obj (kmem_cache_t *cachep, void *addr)
+static int kmem_check_poison_obj (kmem_cache_t *cachep, void *addr)
 {
 	int size = cachep->objsize;
 	void *end;
@@ -584,37 +723,34 @@
  */
 static void kmem_slab_destroy (kmem_cache_t *cachep, slab_t *slabp)
 {
-	if (cachep->dtor
 #if DEBUG
-		|| cachep->flags & (SLAB_POISON | SLAB_RED_ZONE)
-#endif
-	) {
+	int i;
+	for (i = 0; i < cachep->num; i++) {
+		void* objp = slabp->s_mem+cachep->objsize*i;
+		if (cachep->flags & SLAB_POISON)
+			kmem_check_poison_obj(cachep, objp);
+
+		if (cachep->flags & SLAB_RED_ZONE) {
+			if (*((unsigned long*)(objp)) != RED_MAGIC1)
+				BUG();
+			if (*((unsigned long*)(objp + cachep->objsize -
+					BYTES_PER_WORD)) != RED_MAGIC1)
+				BUG();
+			objp += BYTES_PER_WORD;
+		}
+		if (cachep->dtor && !(cachep->flags & SLAB_POISON))
+			(cachep->dtor)(objp, cachep, 0);
+	}
+#else
+	if (cachep->dtor) {
 		int i;
 		for (i = 0; i < cachep->num; i++) {
 			void* objp = slabp->s_mem+cachep->objsize*i;
-#if DEBUG
-			if (cachep->flags & SLAB_RED_ZONE) {
-				if (*((unsigned long*)(objp)) != RED_MAGIC1)
-					BUG();
-				if (*((unsigned long*)(objp + cachep->objsize
-						-BYTES_PER_WORD)) != RED_MAGIC1)
-					BUG();
-				objp += BYTES_PER_WORD;
-			}
-#endif
-			if (cachep->dtor)
-				(cachep->dtor)(objp, cachep, 0);
-#if DEBUG
-			if (cachep->flags & SLAB_RED_ZONE) {
-				objp -= BYTES_PER_WORD;
-			}	
-			if ((cachep->flags & SLAB_POISON)  &&
-				kmem_check_poison_obj(cachep, objp))
-				BUG();
-#endif
+			(cachep->dtor)(objp, cachep, 0);
 		}
 	}
-
+#endif
+	
 	kmem_freepages(cachep, slabp->s_mem-slabp->colouroff);
 	if (OFF_SLAB(cachep))
 		kmem_cache_free(cachep->slabp_cache, slabp);
@@ -701,8 +837,7 @@
 	 * Always checks flags, a caller might be expecting debug
 	 * support which isn't available.
 	 */
-	if (flags & ~CREATE_MASK)
-		BUG();
+	BUG_ON(flags & ~CREATE_MASK);
 
 	/* Get cache's description obj. */
 	cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
@@ -745,7 +880,6 @@
 	if (flags & SLAB_HWCACHE_ALIGN) {
 		/* Need to adjust size so that objs are cache aligned. */
 		/* Small obj size, can get at least two per cache line. */
-		/* FIXME: only power of 2 supported, was better */
 		while (size < align/2)
 			align /= 2;
 		size = (size+align-1)&(~(align-1));
@@ -822,9 +956,10 @@
 		cachep->gfpflags |= GFP_DMA;
 	spin_lock_init(&cachep->spinlock);
 	cachep->objsize = size;
-	INIT_LIST_HEAD(&cachep->slabs_full);
-	INIT_LIST_HEAD(&cachep->slabs_partial);
-	INIT_LIST_HEAD(&cachep->slabs_free);
+	/* NUMA */
+	INIT_LIST_HEAD(&cachep->lists.slabs_full);
+	INIT_LIST_HEAD(&cachep->lists.slabs_partial);
+	INIT_LIST_HEAD(&cachep->lists.slabs_free);
 
 	if (flags & CFLGS_OFF_SLAB)
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);
@@ -832,10 +967,27 @@
 	cachep->dtor = dtor;
 	cachep->name = name;
 
-#ifdef CONFIG_SMP
-	if (g_cpucache_up)
+	if (g_cpucache_up == FULL) {
 		enable_cpucache(cachep);
-#endif
+	} else {
+		if (g_cpucache_up == NONE) {
+			/* Note: the first kmem_cache_create must create
+			 * the cache that's used by kmalloc(16), otherwise
+			 * the creation of further caches will BUG().
+			 */
+			cc_data(cachep) = &cpuarray_generic.cache;
+			g_cpucache_up = PARTIAL;
+		} else {
+			cc_data(cachep) = kmalloc(sizeof(struct cpucache_int),GFP_KERNEL);
+		}
+		BUG_ON(!cc_data(cachep));
+		cc_data(cachep)->avail = 0;
+		cc_data(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
+		cc_data(cachep)->batchcount = 1;
+		cachep->batchcount = 1;
+		cachep->limit = BOOT_CPUCACHE_ENTRIES;
+	} 
+
 	/* Need the semaphore to access the chain. */
 	down(&cache_chain_sem);
 	{
@@ -869,96 +1021,14 @@
 	 */
 	list_add(&cachep->next, &cache_chain);
 	up(&cache_chain_sem);
+	g_cache_count++;
 opps:
 	return cachep;
 }
 
+static void drain_cpu_caches(kmem_cache_t *cachep);
 
-#if DEBUG
-/*
- * This check if the kmem_cache_t pointer is chained in the cache_cache
- * list. -arca
- */
-static int is_chained_kmem_cache(kmem_cache_t * cachep)
-{
-	struct list_head *p;
-	int ret = 0;
-
-	/* Find the cache in the chain of caches. */
-	down(&cache_chain_sem);
-	list_for_each(p, &cache_chain) {
-		if (p == &cachep->next) {
-			ret = 1;
-			break;
-		}
-	}
-	up(&cache_chain_sem);
-
-	return ret;
-}
-#else
-#define is_chained_kmem_cache(x) 1
-#endif
-
-#ifdef CONFIG_SMP
-/*
- * Waits for all CPUs to execute func().
- */
-static void smp_call_function_all_cpus(void (*func) (void *arg), void *arg)
-{
-	local_irq_disable();
-	func(arg);
-	local_irq_enable();
-
-	if (smp_call_function(func, arg, 1, 1))
-		BUG();
-}
-typedef struct ccupdate_struct_s
-{
-	kmem_cache_t *cachep;
-	cpucache_t *new[NR_CPUS];
-} ccupdate_struct_t;
-
-static void do_ccupdate_local(void *info)
-{
-	ccupdate_struct_t *new = (ccupdate_struct_t *)info;
-	cpucache_t *old = cc_data(new->cachep);
-	
-	cc_data(new->cachep) = new->new[smp_processor_id()];
-	new->new[smp_processor_id()] = old;
-}
-
-static void free_block (kmem_cache_t* cachep, void** objpp, int len);
-
-static void drain_cpu_caches(kmem_cache_t *cachep)
-{
-	ccupdate_struct_t new;
-	int i;
-
-	memset(&new.new,0,sizeof(new.new));
-
-	new.cachep = cachep;
-
-	down(&cache_chain_sem);
-	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
-
-	for (i = 0; i < NR_CPUS; i++) {
-		cpucache_t* ccold = new.new[i];
-		if (!ccold || (ccold->avail == 0))
-			continue;
-		local_irq_disable();
-		free_block(cachep, cc_entry(ccold), ccold->avail);
-		local_irq_enable();
-		ccold->avail = 0;
-	}
-	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
-	up(&cache_chain_sem);
-}
-
-#else
-#define drain_cpu_caches(cachep)	do { } while (0)
-#endif
-
+/* NUMA shrink all list3s */
 static int __kmem_cache_shrink(kmem_cache_t *cachep)
 {
 	slab_t *slabp;
@@ -968,26 +1038,23 @@
 
 	spin_lock_irq(&cachep->spinlock);
 
-	/* If the cache is growing, stop shrinking. */
-	while (!cachep->growing) {
+	for(;;) {
 		struct list_head *p;
 
-		p = cachep->slabs_free.prev;
-		if (p == &cachep->slabs_free)
+		p = cachep->lists.slabs_free.prev;
+		if (p == &cachep->lists.slabs_free)
 			break;
 
-		slabp = list_entry(cachep->slabs_free.prev, slab_t, list);
-#if DEBUG
-		if (slabp->inuse)
-			BUG();
-#endif
+		slabp = list_entry(cachep->lists.slabs_free.prev, slab_t, list);
+		BUG_ON(slabp->inuse);
 		list_del(&slabp->list);
 
 		spin_unlock_irq(&cachep->spinlock);
 		kmem_slab_destroy(cachep, slabp);
 		spin_lock_irq(&cachep->spinlock);
 	}
-	ret = !list_empty(&cachep->slabs_full) || !list_empty(&cachep->slabs_partial);
+	ret = !list_empty(&cachep->lists.slabs_full) ||
+		!list_empty(&cachep->lists.slabs_partial);
 	spin_unlock_irq(&cachep->spinlock);
 	return ret;
 }
@@ -1001,8 +1068,7 @@
  */
 int kmem_cache_shrink(kmem_cache_t *cachep)
 {
-	if (!cachep || in_interrupt() || !is_chained_kmem_cache(cachep))
-		BUG();
+	BUG_ON(!cachep || in_interrupt());
 
 	return __kmem_cache_shrink(cachep);
 }
@@ -1024,8 +1090,7 @@
  */
 int kmem_cache_destroy (kmem_cache_t * cachep)
 {
-	if (!cachep || in_interrupt() || cachep->growing)
-		BUG();
+	BUG_ON(!cachep || in_interrupt());
 
 	/* Find the cache in the chain of caches. */
 	down(&cache_chain_sem);
@@ -1044,14 +1109,14 @@
 		up(&cache_chain_sem);
 		return 1;
 	}
-#ifdef CONFIG_SMP
 	{
 		int i;
 		for (i = 0; i < NR_CPUS; i++)
 			kfree(cachep->cpudata[i]);
+		/* NUMA: free the list3 structures */
 	}
-#endif
 	kmem_cache_free(&cache_cache, cachep);
+	g_cache_count--;
 
 	return 0;
 }
@@ -1068,10 +1133,6 @@
 		if (!slabp)
 			return NULL;
 	} else {
-		/* FIXME: change to
-			slabp = objp
-		 * if you enable OPTIMIZE
-		 */
 		slabp = objp+colour_off;
 		colour_off += L1_CACHE_ALIGN(cachep->num *
 				sizeof(kmem_bufctl_t) + sizeof(slab_t));
@@ -1091,34 +1152,30 @@
 	for (i = 0; i < cachep->num; i++) {
 		void* objp = slabp->s_mem+cachep->objsize*i;
 #if DEBUG
+		/* need to poison the objs? */
+		if (cachep->flags & SLAB_POISON)
+			kmem_poison_obj(cachep, objp);
+
 		if (cachep->flags & SLAB_RED_ZONE) {
 			*((unsigned long*)(objp)) = RED_MAGIC1;
 			*((unsigned long*)(objp + cachep->objsize -
 					BYTES_PER_WORD)) = RED_MAGIC1;
 			objp += BYTES_PER_WORD;
 		}
-#endif
-
-		/*
-		 * Constructors are not allowed to allocate memory from
-		 * the same cache which they are a constructor for.
-		 * Otherwise, deadlock. They must also be threaded.
-		 */
-		if (cachep->ctor)
+		if (cachep->ctor && !(cachep->flags & SLAB_POISON))
 			cachep->ctor(objp, cachep, ctor_flags);
-#if DEBUG
-		if (cachep->flags & SLAB_RED_ZONE)
-			objp -= BYTES_PER_WORD;
-		if (cachep->flags & SLAB_POISON)
-			/* need to poison the objs */
-			kmem_poison_obj(cachep, objp);
+
 		if (cachep->flags & SLAB_RED_ZONE) {
+			objp -= BYTES_PER_WORD;
 			if (*((unsigned long*)(objp)) != RED_MAGIC1)
 				BUG();
 			if (*((unsigned long*)(objp + cachep->objsize -
 					BYTES_PER_WORD)) != RED_MAGIC1)
 				BUG();
 		}
+#else
+		if (cachep->ctor)
+			cachep->ctor(objp, cachep, ctor_flags);
 #endif
 		slab_bufctl(slabp)[i] = i+1;
 	}
@@ -1126,6 +1183,31 @@
 	slabp->free = 0;
 }
 
+static void kmem_flagcheck(kmem_cache_t *cachep, int flags)
+{
+/* FIXME: disable in release
+#if DEBUG
+*/
+	if (flags & __GFP_WAIT)
+		might_sleep();
+
+	if (flags & SLAB_DMA) {
+		if (!(cachep->gfpflags & GFP_DMA))
+			BUG();
+	} else {
+		if (cachep->gfpflags & GFP_DMA)
+			BUG();
+	}
+/* #endif */
+}
+
+static inline void check_irq_off(void)
+{
+#if DEBUG
+	BUG_ON(!irqs_disabled());
+#endif
+}
+
 /*
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
@@ -1138,7 +1220,6 @@
 	size_t		 offset;
 	unsigned int	 i, local_flags;
 	unsigned long	 ctor_flags;
-	unsigned long	 save_flags;
 
 	/* Be lazy and only check for valid flags here,
  	 * keeping it out of the critical path in kmem_cache_alloc().
@@ -1148,15 +1229,6 @@
 	if (flags & SLAB_NO_GROW)
 		return 0;
 
-	/*
-	 * The test for missing atomic flag is performed here, rather than
-	 * the more obvious place, simply to reduce the critical path length
-	 * in kmem_cache_alloc(). If a caller is seriously mis-behaving they
-	 * will eventually be caught here (where it matters).
-	 */
-	if (in_interrupt() && (flags & __GFP_WAIT))
-		BUG();
-
 	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
 	local_flags = (flags & SLAB_LEVEL_MASK);
 	if (!(local_flags & __GFP_WAIT))
@@ -1167,7 +1239,8 @@
 		ctor_flags |= SLAB_CTOR_ATOMIC;
 
 	/* About to mess with non-constant members - lock. */
-	spin_lock_irqsave(&cachep->spinlock, save_flags);
+	check_irq_off();
+	spin_lock(&cachep->spinlock);
 
 	/* Get colour for the slab, and cal the next value. */
 	offset = cachep->colour_next;
@@ -1177,17 +1250,19 @@
 	offset *= cachep->colour_off;
 	cachep->dflags |= DFLGS_GROWN;
 
-	cachep->growing++;
-	spin_unlock_irqrestore(&cachep->spinlock, save_flags);
+	spin_unlock(&cachep->spinlock);
+
+	if (local_flags & __GFP_WAIT)
+		local_irq_enable();
 
-	/* A series of memory allocations for a new slab.
-	 * Neither the cache-chain semaphore, or cache-lock, are
-	 * held, but the incrementing c_growing prevents this
-	 * cache from being reaped or shrunk.
-	 * Note: The cache could be selected in for reaping in
-	 * kmem_cache_reap(), but when the final test is made the
-	 * growing value will be seen.
+	/*
+	 * The test for missing atomic flag is performed here, rather than
+	 * the more obvious place, simply to reduce the critical path length
+	 * in kmem_cache_alloc(). If a caller is seriously mis-behaving they
+	 * will eventually be caught here (where it matters).
 	 */
+	kmem_flagcheck(cachep, flags);
+
 
 	/* Get mem for the objs. */
 	if (!(objp = kmem_getpages(cachep, flags)))
@@ -1210,62 +1285,107 @@
 
 	kmem_cache_init_objs(cachep, slabp, ctor_flags);
 
-	spin_lock_irqsave(&cachep->spinlock, save_flags);
-	cachep->growing--;
+	if (local_flags & __GFP_WAIT)
+		local_irq_disable();
+	check_irq_off();
+	spin_lock(&cachep->spinlock);
 
 	/* Make slab active. */
-	list_add_tail(&slabp->list, &cachep->slabs_free);
+	list_add_tail(&slabp->list, &(list3_data(cachep)->slabs_free));
 	STATS_INC_GROWN(cachep);
-	cachep->failures = 0;
-
-	spin_unlock_irqrestore(&cachep->spinlock, save_flags);
+	spin_unlock(&cachep->spinlock);
 	return 1;
 opps1:
 	kmem_freepages(cachep, objp);
 failed:
-	spin_lock_irqsave(&cachep->spinlock, save_flags);
-	cachep->growing--;
-	spin_unlock_irqrestore(&cachep->spinlock, save_flags);
 	return 0;
 }
 
 /*
  * Perform extra freeing checks:
- * - detect double free
  * - detect bad pointers.
- * Called with the cache-lock held.
+ * - POISON/RED_ZONE checking
+ * - destructor calls, for caches with POISON+dtor
  */
-
-#if DEBUG
-static int kmem_extra_free_checks (kmem_cache_t * cachep,
-			slab_t *slabp, void * objp)
+static inline void kfree_debugcheck(const void *objp)
 {
-	int i;
-	unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
+#if DEBUG
+	struct page *page;
 
-	if (objnr >= cachep->num)
-		BUG();
-	if (objp != slabp->s_mem + objnr*cachep->objsize)
+	if (!virt_addr_valid(objp)) {
+		printk(KERN_ERR "kfree_debugcheck: out of range ptr %lxh.\n",
+			(unsigned long)objp);	
+		BUG();	
+	}
+	page = virt_to_page(objp);
+	if (!PageSlab(page)) {
+		printk(KERN_ERR "kfree_debugcheck: bad ptr %lxh.\n", (unsigned long)objp);
 		BUG();
+	}
+#endif 
+}
 
-	/* Check slab's freelist to see if this obj is there. */
-	for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
-		if (i == objnr)
+static inline void *kmem_cache_free_debugcheck (kmem_cache_t * cachep, void * objp)
+{
+#if DEBUG
+	struct page *page;
+	unsigned int objnr;
+	slab_t *slabp;
+
+	kfree_debugcheck(objp);
+	page = virt_to_page(objp);
+
+	BUG_ON(GET_PAGE_CACHE(page) != cachep);
+	slabp = GET_PAGE_SLAB(page);
+
+	if (cachep->flags & SLAB_RED_ZONE) {
+		objp -= BYTES_PER_WORD;
+		if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
+			/* Either write before start, or a double free. */
+			BUG();
+		if (xchg((unsigned long *)(objp+cachep->objsize -
+				BYTES_PER_WORD), RED_MAGIC1) != RED_MAGIC2)
+			/* Either write past end, or a double free. */
 			BUG();
 	}
-	return 0;
-}
+
+	objnr = (objp-slabp->s_mem)/cachep->objsize;
+
+	BUG_ON(objnr >= cachep->num);
+	BUG_ON(objp != slabp->s_mem + objnr*cachep->objsize);
+
+	if (cachep->flags & SLAB_DEBUG_INITIAL) {
+		/* Need to call the slab's constructor so the
+		 * caller can perform a verify of its state (debugging).
+		 * Called without the cache-lock held.
+		 */
+		cachep->ctor(objp, cachep, SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_VERIFY);
+	}
+	if (cachep->flags & SLAB_POISON && cachep->dtor) {
+		/* we want to cache poison the object,
+		 * call the destruction callback
+		 */
+		cachep->dtor(objp, cachep, 0);
+	}
+	if (cachep->flags & SLAB_POISON) {
+		kmem_poison_obj(cachep, objp);
+	}
 #endif
+	return objp;
+}
 
-static inline void kmem_cache_alloc_head(kmem_cache_t *cachep, int flags)
+static inline void kmem_check_slabp(kmem_cache_t *cachep, slab_t *slabp)
 {
-	if (flags & SLAB_DMA) {
-		if (!(cachep->gfpflags & GFP_DMA))
-			BUG();
-	} else {
-		if (cachep->gfpflags & GFP_DMA)
-			BUG();
+#if DEBUG
+	int i;
+	int entries = 0;
+	/* Check slab's freelist to see if this obj is there. */
+	for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
+		entries++;
+		BUG_ON(entries > cachep->num);
 	}
+	BUG_ON(entries != cachep->num - slabp->inuse);
+#endif
 }
 
 static inline void * kmem_cache_alloc_one_tail (kmem_cache_t *cachep,
@@ -1282,209 +1402,135 @@
 	objp = slabp->s_mem + slabp->free*cachep->objsize;
 	slabp->free=slab_bufctl(slabp)[slabp->free];
 
-	if (unlikely(slabp->free == BUFCTL_END)) {
-		list_del(&slabp->list);
-		list_add(&slabp->list, &cachep->slabs_full);
-	}
-#if DEBUG
-	if (cachep->flags & SLAB_POISON)
-		if (kmem_check_poison_obj(cachep, objp))
-			BUG();
-	if (cachep->flags & SLAB_RED_ZONE) {
-		/* Set alloc red-zone, and check old one. */
-		if (xchg((unsigned long *)objp, RED_MAGIC2) !=
-							 RED_MAGIC1)
-			BUG();
-		if (xchg((unsigned long *)(objp+cachep->objsize -
-			  BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1)
-			BUG();
-		objp += BYTES_PER_WORD;
-	}
-#endif
 	return objp;
 }
 
-/*
- * Returns a ptr to an obj in the given cache.
- * caller must guarantee synchronization
- * #define for the goto optimization 8-)
- */
-#define kmem_cache_alloc_one(cachep)				\
-({								\
-	struct list_head * slabs_partial, * entry;		\
-	slab_t *slabp;						\
-								\
-	slabs_partial = &(cachep)->slabs_partial;		\
-	entry = slabs_partial->next;				\
-	if (unlikely(entry == slabs_partial)) {			\
-		struct list_head * slabs_free;			\
-		slabs_free = &(cachep)->slabs_free;		\
-		entry = slabs_free->next;			\
-		if (unlikely(entry == slabs_free))		\
-			goto alloc_new_slab;			\
-		list_del(entry);				\
-		list_add(entry, slabs_partial);			\
-	}							\
-								\
-	slabp = list_entry(entry, slab_t, list);		\
-	kmem_cache_alloc_one_tail(cachep, slabp);		\
-})
+static inline void kmem_cache_alloc_listfixup(struct kmem_list3 *l3, slab_t *slabp)
+{
+	list_del(&slabp->list);
+	if (slabp->free == BUFCTL_END) {
+		list_add(&slabp->list, &l3->slabs_full);
+	} else {
+		list_add(&slabp->list, &l3->slabs_partial);
+	}
+}
 
-#ifdef CONFIG_SMP
-void* kmem_cache_alloc_batch(kmem_cache_t* cachep, int flags)
+static void* kmem_cache_alloc_refill(kmem_cache_t* cachep, int flags)
 {
-	int batchcount = cachep->batchcount;
-	cpucache_t* cc = cc_data(cachep);
+	int batchcount;
+	struct kmem_list3 *l3;
+	cpucache_t *cc;
+
+retry:
+	/* reload cache state after each grow:
+	 * it might have reenabled the cpu interrupts
+	 */
+	cc = cc_data(cachep);
+	batchcount = cc->batchcount;
+	l3 = list3_data(cachep);
 
+	check_irq_off();
 	spin_lock(&cachep->spinlock);
-	while (batchcount--) {
-		struct list_head * slabs_partial, * entry;
+	while (batchcount > 0) {
+		struct list_head *entry;
 		slab_t *slabp;
 		/* Get slab alloc is to come from. */
-		slabs_partial = &(cachep)->slabs_partial;
-		entry = slabs_partial->next;
-		if (unlikely(entry == slabs_partial)) {
-			struct list_head * slabs_free;
-			slabs_free = &(cachep)->slabs_free;
-			entry = slabs_free->next;
-			if (unlikely(entry == slabs_free))
-				break;
-			list_del(entry);
-			list_add(entry, slabs_partial);
+		entry = l3->slabs_partial.next;
+		if (entry == &l3->slabs_partial) {
+			entry = l3->slabs_free.next;
+			if (entry == &l3->slabs_free)
+				goto must_grow;
 		}
 
 		slabp = list_entry(entry, slab_t, list);
-		cc_entry(cc)[cc->avail++] =
+		kmem_check_slabp(cachep, slabp);
+		while (slabp->inuse < cachep->num && batchcount--)
+			cc_entry(cc)[cc->avail++] =
 				kmem_cache_alloc_one_tail(cachep, slabp);
+		kmem_check_slabp(cachep, slabp);
+		kmem_cache_alloc_listfixup(l3, slabp);
 	}
+
+must_grow:
 	spin_unlock(&cachep->spinlock);
 
 	if (cc->avail)
 		return cc_entry(cc)[--cc->avail];
+	
+	if(kmem_cache_grow(cachep, flags))
+		goto retry;
+
 	return NULL;
 }
+
+static inline void kmem_cache_alloc_debugcheck_before(kmem_cache_t *cachep, int flags)
+{
+#if DEBUG
+	kmem_flagcheck(cachep, flags);
 #endif
+}
+
+static inline void *kmem_cache_alloc_debugcheck_after (kmem_cache_t *cachep, unsigned long flags, void *objp)
+{
+#if DEBUG
+	if (cachep->flags & SLAB_POISON)
+		if (kmem_check_poison_obj(cachep, objp))
+			BUG();
+	if (cachep->flags & SLAB_RED_ZONE) {
+		/* Set alloc red-zone, and check old one. */
+		if (xchg((unsigned long *)objp, RED_MAGIC2) !=
+							 RED_MAGIC1)
+			BUG();
+		if (xchg((unsigned long *)(objp+cachep->objsize -
+			  BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1)
+			BUG();
+		objp += BYTES_PER_WORD;
+		if (cachep->ctor) {
+			unsigned long	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+
+			if (!flags & __GFP_WAIT)
+				ctor_flags |= SLAB_CTOR_ATOMIC;
+
+			cachep->ctor(objp, cachep, ctor_flags);
+		}	
+	}
+#endif
+	return objp;
+}
+
 
 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags)
 {
 	unsigned long save_flags;
 	void* objp;
+	cpucache_t *cc;
 
-	if (flags & __GFP_WAIT)
-		might_sleep();
+	kmem_cache_alloc_debugcheck_before(cachep, flags);
 
-	kmem_cache_alloc_head(cachep, flags);
-try_again:
 	local_irq_save(save_flags);
-#ifdef CONFIG_SMP
-	{
-		cpucache_t *cc = cc_data(cachep);
-
-		if (cc) {
-			if (cc->avail) {
-				STATS_INC_ALLOCHIT(cachep);
-				objp = cc_entry(cc)[--cc->avail];
-			} else {
-				STATS_INC_ALLOCMISS(cachep);
-				objp = kmem_cache_alloc_batch(cachep,flags);
-				local_irq_restore(save_flags);
-				if (!objp)
-					goto alloc_new_slab_nolock;
-				return objp;
-			}
-		} else {
-			spin_lock(&cachep->spinlock);
-			objp = kmem_cache_alloc_one(cachep);
-			spin_unlock(&cachep->spinlock);
-		}
+	cc = cc_data(cachep);
+	if (likely(cc->avail)) {
+		STATS_INC_ALLOCHIT(cachep);
+		objp = cc_entry(cc)[--cc->avail];
+	} else {
+		STATS_INC_ALLOCMISS(cachep);
+		objp = kmem_cache_alloc_refill(cachep, flags);
 	}
-#else
-	objp = kmem_cache_alloc_one(cachep);
-#endif
 	local_irq_restore(save_flags);
+	objp = kmem_cache_alloc_debugcheck_after(cachep, flags, objp);
 	return objp;
-alloc_new_slab:
-#ifdef CONFIG_SMP
-	spin_unlock(&cachep->spinlock);
-#endif
-	local_irq_restore(save_flags);
-#ifdef CONFIG_SMP
-alloc_new_slab_nolock:
-#endif
-	if (kmem_cache_grow(cachep, flags))
-		/* Someone may have stolen our objs.  Doesn't matter, we'll
-		 * just come back here again.
-		 */
-		goto try_again;
-	return NULL;
 }
 
-/*
- * Release an obj back to its cache. If the obj has a constructed
- * state, it should be in this state _before_ it is released.
- * - caller is responsible for the synchronization
+/* 
+ * NUMA: different approach needed if the spinlock is moved into
+ * the l3 structure
  */
-
-#if DEBUG
-# define CHECK_NR(pg)						\
-	do {							\
-		if (!virt_addr_valid(pg)) {			\
-			printk(KERN_ERR "kfree: out of range ptr %lxh.\n", \
-				(unsigned long)objp);		\
-			BUG();					\
-		} \
-	} while (0)
-# define CHECK_PAGE(addr)					\
-	do {							\
-		struct page *page = virt_to_page(addr);		\
-		CHECK_NR(addr);					\
-		if (!PageSlab(page)) {				\
-			printk(KERN_ERR "kfree: bad ptr %lxh.\n", \
-				(unsigned long)objp);		\
-			BUG();					\
-		}						\
-	} while (0)
-
-#else
-# define CHECK_PAGE(pg)	do { } while (0)
-#endif
-
 static inline void kmem_cache_free_one(kmem_cache_t *cachep, void *objp)
 {
 	slab_t* slabp;
 
-	CHECK_PAGE(objp);
-	/* reduces memory footprint
-	 *
-	if (OPTIMIZE(cachep))
-		slabp = (void*)((unsigned long)objp&(~(PAGE_SIZE-1)));
-	 else
-	 */
 	slabp = GET_PAGE_SLAB(virt_to_page(objp));
-
-#if DEBUG
-	if (cachep->flags & SLAB_DEBUG_INITIAL)
-		/* Need to call the slab's constructor so the
-		 * caller can perform a verify of its state (debugging).
-		 * Called without the cache-lock held.
-		 */
-		cachep->ctor(objp, cachep, SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_VERIFY);
-
-	if (cachep->flags & SLAB_RED_ZONE) {
-		objp -= BYTES_PER_WORD;
-		if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
-			/* Either write before start, or a double free. */
-			BUG();
-		if (xchg((unsigned long *)(objp+cachep->objsize -
-				BYTES_PER_WORD), RED_MAGIC1) != RED_MAGIC2)
-			/* Either write past end, or a double free. */
-			BUG();
-	}
-	if (cachep->flags & SLAB_POISON)
-		kmem_poison_obj(cachep, objp);
-	if (kmem_extra_free_checks(cachep, slabp, objp))
-		return;
-#endif
+	list_del(&slabp->list);
 	{
 		unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
 
@@ -1494,69 +1540,82 @@
 	STATS_DEC_ACTIVE(cachep);
 	
 	/* fixup slab chains */
-	{
-		int inuse = slabp->inuse;
-		if (unlikely(!--slabp->inuse)) {
-			/* Was partial or full, now empty. */
-			list_del(&slabp->list);
-/*			list_add(&slabp->list, &cachep->slabs_free); 		*/
-			if (unlikely(list_empty(&cachep->slabs_partial)))
-				list_add(&slabp->list, &cachep->slabs_partial);
-			else
-				kmem_slab_destroy(cachep, slabp);
-		} else if (unlikely(inuse == cachep->num)) {
-			/* Was full. */
-			list_del(&slabp->list);
-			list_add(&slabp->list, &cachep->slabs_partial);
-		}
+	if (unlikely(!--slabp->inuse)) {
+		list_add(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_free);
+	} else {
+		/* Unconditionally move a slab to the end of the
+		 * partial list on free - maximum time for the
+		 * other objects to be freed, too.
+		 */
+		list_add_tail(&slabp->list, &list3_data_ptr(cachep, objp)->slabs_partial);
 	}
 }
 
-#ifdef CONFIG_SMP
-static inline void __free_block (kmem_cache_t* cachep,
-							void** objpp, int len)
+static inline void __free_block (kmem_cache_t* cachep, void** objpp, int len)
 {
 	for ( ; len > 0; len--, objpp++)
 		kmem_cache_free_one(cachep, *objpp);
 }
 
-static void free_block (kmem_cache_t* cachep, void** objpp, int len)
+
+static void kmem_cache_flusharray (kmem_cache_t* cachep, cpucache_t *cc)
 {
+	int batchcount;
+
+	batchcount = cc->batchcount;
+#if DEBUG
+	BUG_ON(!batchcount || batchcount > cc->avail);
+#endif
+	check_irq_off();
 	spin_lock(&cachep->spinlock);
-	__free_block(cachep, objpp, len);
+	__free_block(cachep, &cc_entry(cc)[0], batchcount);
+
+#if STATS
+	{
+		int i = 0;
+		struct list_head *p;
+		p = list3_data(cachep)->slabs_free.next;
+		while (p != &(list3_data(cachep)->slabs_free)) {
+			slab_t *slabp;
+
+			slabp = list_entry(p, slab_t, list);
+			BUG_ON(slabp->inuse);
+
+			i++;
+			p = p->next;
+		}
+		STATS_SET_FREEABLE(cachep, i);
+	}
+#endif
+
 	spin_unlock(&cachep->spinlock);
+	cc->avail -= batchcount;
+	memcpy(&cc_entry(cc)[0], &cc_entry(cc)[batchcount],
+			sizeof(void*)*cc->avail);
 }
-#endif
 
 /*
  * __kmem_cache_free
- * called with disabled ints
+ * Release an obj back to its cache. If the obj has a constructed
+ * state, it must be in this state _before_ it is released.
+ *
+ * Called with disabled ints.
  */
 static inline void __kmem_cache_free (kmem_cache_t *cachep, void* objp)
 {
-#ifdef CONFIG_SMP
-	cpucache_t *cc = cc_data(cachep);
+	cpucache_t *cc = cc_data_ptr(cachep, objp);
 
-	CHECK_PAGE(objp);
-	if (cc) {
-		int batchcount;
-		if (cc->avail < cc->limit) {
-			STATS_INC_FREEHIT(cachep);
-			cc_entry(cc)[cc->avail++] = objp;
-			return;
-		}
-		STATS_INC_FREEMISS(cachep);
-		batchcount = cachep->batchcount;
-		cc->avail -= batchcount;
-		free_block(cachep, &cc_entry(cc)[cc->avail], batchcount);
+	objp = kmem_cache_free_debugcheck(cachep, objp);
+
+	if (likely(cc->avail < cc->limit)) {
+		STATS_INC_FREEHIT(cachep);
 		cc_entry(cc)[cc->avail++] = objp;
 		return;
 	} else {
-		free_block(cachep, &objp, 1);
+		STATS_INC_FREEMISS(cachep);
+		kmem_cache_flusharray(cachep, cc);
+		cc_entry(cc)[cc->avail++] = objp;
 	}
-#else
-	kmem_cache_free_one(cachep, objp);
-#endif
 }
 
 /**
@@ -1600,6 +1659,13 @@
 	for (; csizep->cs_size; csizep++) {
 		if (size > csizep->cs_size)
 			continue;
+#if DEBUG
+		/* This happens if someone tries to call
+		 * kmem_cache_create(), or kmalloc(), before
+		 * the generic caches are initialized.
+		 */
+		BUG_ON(csizep->cs_cachep == NULL);
+#endif
 		return __kmem_cache_alloc(flags & GFP_DMA ?
 			 csizep->cs_dmacachep : csizep->cs_cachep, flags);
 	}
@@ -1617,11 +1683,6 @@
 void kmem_cache_free (kmem_cache_t *cachep, void *objp)
 {
 	unsigned long flags;
-#if DEBUG
-	CHECK_PAGE(objp);
-	if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
-		BUG();
-#endif
 
 	local_irq_save(flags);
 	__kmem_cache_free(cachep, objp);
@@ -1643,7 +1704,7 @@
 	if (!objp)
 		return;
 	local_irq_save(flags);
-	CHECK_PAGE(objp);
+	kfree_debugcheck(objp);
 	c = GET_PAGE_CACHE(virt_to_page(objp));
 	__kmem_cache_free(c, (void*)objp);
 	local_irq_restore(flags);
@@ -1674,86 +1735,114 @@
 	return (gfpflags & GFP_DMA) ? csizep->cs_dmacachep : csizep->cs_cachep;
 }
 
-#ifdef CONFIG_SMP
+/*
+ * Waits for all CPUs to execute func().
+ */
+static void smp_call_function_all_cpus(void (*func) (void *arg), void *arg)
+{
+	local_irq_disable();
+	func(arg);
+	local_irq_enable();
 
-/* called with cache_chain_sem acquired.  */
-static int kmem_tune_cpucache (kmem_cache_t* cachep, int limit, int batchcount)
+	if (smp_call_function(func, arg, 1, 1))
+		BUG();
+}
+
+static void do_drain(void *arg)
 {
-	ccupdate_struct_t new;
-	int i;
+	kmem_cache_t *cachep = (kmem_cache_t*)arg;
+	cpucache_t *cc;
 
-	/*
-	 * These are admin-provided, so we are more graceful.
-	 */
-	if (limit < 0)
-		return -EINVAL;
-	if (batchcount < 0)
-		return -EINVAL;
-	if (batchcount > limit)
-		return -EINVAL;
-	if (limit != 0 && !batchcount)
-		return -EINVAL;
+	cc = cc_data(cachep);
+	check_irq_off();
+	spin_lock(&cachep->spinlock);
+	__free_block(cachep, &cc_entry(cc)[0], cc->avail);
+	spin_unlock(&cachep->spinlock);
+	cc->avail = 0;
+}
+
+static void drain_cpu_caches(kmem_cache_t *cachep)
+{
+	smp_call_function_all_cpus(do_drain, cachep);
+}
+
+
+struct ccupdate_struct {
+	kmem_cache_t *cachep;
+	cpucache_t *new[NR_CPUS];
+};
+
+static void do_ccupdate_local(void *info)
+{
+	struct ccupdate_struct *new = (struct ccupdate_struct *)info;
+	cpucache_t *old;
+	old = cc_data(new->cachep);
+	
+	cc_data(new->cachep) = new->new[smp_processor_id()];
+	new->new[smp_processor_id()] = old;
+}
+
+
+static int do_tune_cpucache (kmem_cache_t* cachep, int limit, int batchcount)
+{
+	struct ccupdate_struct new;
+	int i;
 
 	memset(&new.new,0,sizeof(new.new));
-	if (limit) {
-		for (i = 0; i < NR_CPUS; i++) {
-			cpucache_t* ccnew;
-
-			ccnew = kmalloc(sizeof(void*)*limit+
-					sizeof(cpucache_t), GFP_KERNEL);
-			if (!ccnew) {
-				for (i--; i >= 0; i--) kfree(new.new[i]);
-				return -ENOMEM;
-			}
-			ccnew->limit = limit;
-			ccnew->avail = 0;
-			new.new[i] = ccnew;
+	for (i = 0; i < NR_CPUS; i++) {
+		cpucache_t* ccnew;
+
+		ccnew = kmalloc(sizeof(void*)*limit+
+				sizeof(cpucache_t), GFP_KERNEL);
+		if (!ccnew) {
+			for (i--; i >= 0; i--) kfree(new.new[i]);
+			return -ENOMEM;
 		}
+		ccnew->avail = 0;
+		ccnew->limit = limit;
+		ccnew->batchcount = batchcount;
+		new.new[i] = ccnew;
 	}
 	new.cachep = cachep;
+
+	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
+	
 	spin_lock_irq(&cachep->spinlock);
 	cachep->batchcount = batchcount;
+	cachep->limit = limit;
 	spin_unlock_irq(&cachep->spinlock);
 
-	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
-
 	for (i = 0; i < NR_CPUS; i++) {
 		cpucache_t* ccold = new.new[i];
 		if (!ccold)
 			continue;
 		local_irq_disable();
-		free_block(cachep, cc_entry(ccold), ccold->avail);
+		__free_block(cachep, cc_entry(ccold), ccold->avail);
 		local_irq_enable();
 		kfree(ccold);
 	}
 	return 0;
 }
 
-/* 
- * If slab debugging is enabled, don't batch slabs
- * on the per-cpu lists by defaults.
- */
+
 static void enable_cpucache (kmem_cache_t *cachep)
 {
-#ifndef CONFIG_DEBUG_SLAB
 	int err;
 	int limit;
 
-	/* FIXME: optimize */
 	if (cachep->objsize > PAGE_SIZE)
-		return;
-	if (cachep->objsize > 1024)
+		limit = 4;
+	else if (cachep->objsize > 1024)
 		limit = 60;
 	else if (cachep->objsize > 256)
 		limit = 124;
 	else
 		limit = 252;
 
-	err = kmem_tune_cpucache(cachep, limit, limit/2);
+	err = do_tune_cpucache(cachep, limit, limit/2);
 	if (err)
 		printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
 					cachep->name, -err);
-#endif
 }
 
 static void enable_all_cpucaches (void)
@@ -1765,143 +1854,114 @@
 	p = &cache_cache.next;
 	do {
 		kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
-
 		enable_cpucache(cachep);
 		p = cachep->next.next;
 	} while (p != &cache_cache.next);
 
 	up(&cache_chain_sem);
 }
-#endif
+
+#define REAP_SCANLEN	10
 
 /**
- * kmem_cache_reap - Reclaim memory from caches.
- * @gfp_mask: the type of memory required.
+ * cache_reap - Reclaim memory from caches.
  *
- * Called from do_try_to_free_pages() and __alloc_pages()
+ * Called from a timer, 10 times/second.
+ * Purpuse:
+ * - clear the per-cpu caches
+ * - return freeable pages to gfp.
  */
-int kmem_cache_reap (int gfp_mask)
+static int cache_reap (void)
 {
-	slab_t *slabp;
+	int todo;
 	kmem_cache_t *searchp;
-	kmem_cache_t *best_cachep;
-	unsigned int best_pages;
-	unsigned int best_len;
-	unsigned int scan;
-	int ret = 0;
 
-	if (gfp_mask & __GFP_WAIT)
-		down(&cache_chain_sem);
-	else
-		if (down_trylock(&cache_chain_sem))
-			return 0;
+	if (down_trylock(&cache_chain_sem))
+		return -1;
 
-	scan = REAP_SCANLEN;
-	best_len = 0;
-	best_pages = 0;
-	best_cachep = NULL;
+	todo = (g_cache_count+REAP_SCANLEN-1)/REAP_SCANLEN;
 	searchp = clock_searchp;
+
 	do {
-		unsigned int pages;
 		struct list_head* p;
 		unsigned int full_free;
+		cpucache_t *cc;
+		slab_t *slabp;
 
-		/* It's safe to test this without holding the cache-lock. */
 		if (searchp->flags & SLAB_NO_REAP)
 			goto next;
+
 		spin_lock_irq(&searchp->spinlock);
-		if (searchp->growing)
-			goto next_unlock;
 		if (searchp->dflags & DFLGS_GROWN) {
 			searchp->dflags &= ~DFLGS_GROWN;
 			goto next_unlock;
 		}
-#ifdef CONFIG_SMP
-		{
-			cpucache_t *cc = cc_data(searchp);
-			if (cc && cc->avail) {
-				__free_block(searchp, cc_entry(cc), cc->avail);
-				cc->avail = 0;
-			}
+		cc = cc_data(searchp);
+		if (cc && cc->avail) {
+			__free_block(searchp, cc_entry(cc), cc->avail);
+			cc->avail = 0;
 		}
-#endif
 
 		full_free = 0;
-		p = searchp->slabs_free.next;
-		while (p != &searchp->slabs_free) {
+		p = searchp->lists.slabs_free.next;
+		while (p != &searchp->lists.slabs_free) {
 			slabp = list_entry(p, slab_t, list);
-#if DEBUG
-			if (slabp->inuse)
-				BUG();
-#endif
+			BUG_ON(slabp->inuse);
+
 			full_free++;
 			p = p->next;
 		}
 
-		/*
-		 * Try to avoid slabs with constructors and/or
-		 * more than one page per slab (as it can be difficult
-		 * to get high orders from gfp()).
-		 */
-		pages = full_free * (1<<searchp->gfporder);
-		if (searchp->ctor)
-			pages = (pages*4+1)/5;
-		if (searchp->gfporder)
-			pages = (pages*4+1)/5;
-		if (pages > best_pages) {
-			best_cachep = searchp;
-			best_len = full_free;
-			best_pages = pages;
-			if (pages >= REAP_PERFECT) {
-				clock_searchp = list_entry(searchp->next.next,
-							kmem_cache_t,next);
-				goto perfect;
+		if(full_free) {
+			full_free = (full_free*4+1)/5;
+			while (full_free--) {
+				p = searchp->lists.slabs_free.next;
+				if (p == &searchp->lists.slabs_free)
+					break;
+				slabp = list_entry(p, slab_t, list);
+				BUG_ON(slabp->inuse);
+				list_del(&slabp->list);
+				STATS_INC_REAPED(searchp);
+
+				/* Safe to drop the lock. The slab is no longer
+				 * linked to the cache.
+				 * cachep cannot disappear, we hold
+				 * cache_chain_sem
+				 */
+				spin_unlock_irq(&searchp->spinlock);
+				kmem_slab_destroy(searchp, slabp);
+				spin_lock_irq(&searchp->spinlock);
 			}
 		}
+
 next_unlock:
 		spin_unlock_irq(&searchp->spinlock);
 next:
 		searchp = list_entry(searchp->next.next,kmem_cache_t,next);
-	} while (--scan && searchp != clock_searchp);
+	} while (--todo);
 
 	clock_searchp = searchp;
+	up(&cache_chain_sem);
+	return 0;
+}
 
-	if (!best_cachep)
-		/* couldn't find anything to reap */
-		goto out;
-
-	spin_lock_irq(&best_cachep->spinlock);
-perfect:
-	/* free only 50% of the free slabs */
-	best_len = (best_len + 1)/2;
-	for (scan = 0; scan < best_len; scan++) {
-		struct list_head *p;
+static void reap_timer_fnc(unsigned long data)
+{
+	unsigned long expires;
 
-		if (best_cachep->growing)
-			break;
-		p = best_cachep->slabs_free.prev;
-		if (p == &best_cachep->slabs_free)
-			break;
-		slabp = list_entry(p,slab_t,list);
-#if DEBUG
-		if (slabp->inuse)
-			BUG();
-#endif
-		list_del(&slabp->list);
-		STATS_INC_REAPED(best_cachep);
+	// Optimization question: fewer reaps
+	// means less probability for unnessary
+	// cpucache drain/refill cycles
+	//
+	// OTHO a rare flush means that e.g.
+	// 4 order=5 objects could sit in the per-cpu array
+	if(cache_reap() < 0)
+		expires = 1;
+	else
+		expires = 2*HZ/REAP_SCANLEN;
 
-		/* Safe to drop the lock. The slab is no longer linked to the
-		 * cache.
-		 */
-		spin_unlock_irq(&best_cachep->spinlock);
-		kmem_slab_destroy(best_cachep, slabp);
-		spin_lock_irq(&best_cachep->spinlock);
-	}
-	spin_unlock_irq(&best_cachep->spinlock);
-	ret = scan * (1 << best_cachep->gfporder);
-out:
-	up(&cache_chain_sem);
-	return ret;
+	reap_timer.expires = jiffies + expires;
+	add_timer(&reap_timer);
 }
 
 #ifdef CONFIG_PROC_FS
@@ -1954,13 +2014,10 @@
 		 * Output format version, so at least we can change it
 		 * without _too_ many complaints.
 		 */
-		seq_puts(m, "slabinfo - version: 1.1"
+		seq_puts(m, "slabinfo - version: 1.2"
 #if STATS
 				" (statistics)"
 #endif
-#ifdef CONFIG_SMP
-				" (SMP)"
-#endif
 				"\n");
 		return 0;
 	}
@@ -1968,21 +2025,21 @@
 	spin_lock_irq(&cachep->spinlock);
 	active_objs = 0;
 	num_slabs = 0;
-	list_for_each(q,&cachep->slabs_full) {
+	list_for_each(q,&cachep->lists.slabs_full) {
 		slabp = list_entry(q, slab_t, list);
 		if (slabp->inuse != cachep->num)
 			BUG();
 		active_objs += cachep->num;
 		active_slabs++;
 	}
-	list_for_each(q,&cachep->slabs_partial) {
+	list_for_each(q,&cachep->lists.slabs_partial) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse == cachep->num)
+		if (slabp->inuse == cachep->num || !slabp->inuse)
 			BUG();
 		active_objs += slabp->inuse;
 		active_slabs++;
 	}
-	list_for_each(q,&cachep->slabs_free) {
+	list_for_each(q,&cachep->lists.slabs_free) {
 		slabp = list_entry(q, slab_t, list);
 		if (slabp->inuse)
 			BUG();
@@ -2007,19 +2064,6 @@
 		name, active_objs, num_objs, cachep->objsize,
 		active_slabs, num_slabs, (1<<cachep->gfporder));
 
-#if STATS
-	{
-		unsigned long errors = cachep->errors;
-		unsigned long high = cachep->high_mark;
-		unsigned long grown = cachep->grown;
-		unsigned long reaped = cachep->reaped;
-		unsigned long allocs = cachep->num_allocations;
-
-		seq_printf(m, " : %6lu %7lu %5lu %4lu %4lu",
-				high, allocs, grown, reaped, errors);
-	}
-#endif
-#ifdef CONFIG_SMP
 	{
 		unsigned int batchcount = cachep->batchcount;
 		unsigned int limit;
@@ -2030,13 +2074,21 @@
 			limit = 0;
 		seq_printf(m, " : %4u %4u", limit, batchcount);
 	}
-#endif
-#if STATS && defined(CONFIG_SMP)
+#if STATS
 	{
+		unsigned long high = cachep->high_mark;
+		unsigned long allocs = cachep->num_allocations;
+		unsigned long grown = cachep->grown;
+		unsigned long reaped = cachep->reaped;
+		unsigned long errors = cachep->errors;
+		unsigned long max_freeable = cachep->max_freeable;
 		unsigned long allochit = atomic_read(&cachep->allochit);
 		unsigned long allocmiss = atomic_read(&cachep->allocmiss);
 		unsigned long freehit = atomic_read(&cachep->freehit);
 		unsigned long freemiss = atomic_read(&cachep->freemiss);
+
+		seq_printf(m, " : %6lu %7lu %5lu %4lu %4lu %4lu",
+				high, allocs, grown, reaped, errors, max_freeable);
 		seq_printf(m, " : %6lu %6lu %6lu %6lu",
 				allochit, allocmiss, freehit, freemiss);
 	}
@@ -2078,7 +2130,6 @@
 ssize_t slabinfo_write(struct file *file, const char *buffer,
 				size_t count, loff_t *ppos)
 {
-#ifdef CONFIG_SMP
 	char kbuf[MAX_SLABINFO_WRITE+1], *tmp;
 	int limit, batchcount, res;
 	struct list_head *p;
@@ -2106,7 +2157,13 @@
 		kmem_cache_t *cachep = list_entry(p, kmem_cache_t, next);
 
 		if (!strcmp(cachep->name, kbuf)) {
-			res = kmem_tune_cpucache(cachep, limit, batchcount);
+			if (limit < 1 ||
+			    batchcount < 1 ||
+			    batchcount > limit) {
+				res = -EINVAL;
+			} else {
+				res = do_tune_cpucache(cachep, limit, batchcount);
+			}
 			break;
 		}
 	}
@@ -2114,8 +2171,5 @@
 	if (res >= 0)
 		res = count;
 	return res;
-#else
-	return -EINVAL;
-#endif
 }
 #endif



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATH] slab cleanup
  2002-09-29 12:43 [PATH] slab cleanup Manfred Spraul
@ 2002-09-29 13:56 ` Ed Tomlinson
  2002-09-30  6:55 ` Martin J. Bligh
  1 sibling, 0 replies; 4+ messages in thread
From: Ed Tomlinson @ 2002-09-29 13:56 UTC (permalink / raw)
  To: Manfred Spraul, lse-tech
  Cc: Martin J. Bligh, akpm, Kamble, Nitin A, linux-mm, linux-kernel

On September 29, 2002 08:43 am, Manfred Spraul wrote:
> I think we should make some cleanups before adding NUMA:

Exactly my thoughts.

> a) make the per-cpu arrays unconditional, even on UP.
> 	The arrays provides LIFO ordering, which should improve
> 	cache hit rates. The disadvantage is probably a higher
> 	internal fragmentation [not measured], but memory is cheap,
> 	and cache hit are important.
> 	In addition, this makes it possible to perform some cleanups.
> 	No more kmem_cache_alloc_one, for example.

This sounds like a very good idea.

> b) use arrays for all caches, even with large objects.
> 	E.g. right now, the 16 kB loopback socket buffers
> 	are not handled per-cpu.

Agree here also

> Nitin, in your NUMA patch, you return each "wrong-node" object
> immediately, without batching them together. Have you considered
> batching? It could reduce the number of spinlock operations dramatically.
>
> There is one additional problem without an obvious solution:
> kmem_cache_reap() is too inefficient. Before 2.5.39, it served 2 purposes:
>
> 1) return freeable slabs back to gfp.
>
> 	<2.5.39 scans through the caches in every try_to_free_pages.
> 	That scan is terribly inefficient, and akpm noticed lots of
> 	idle reschedules on the cache_chain_sem
>
> 	2.5.39 limits the freeable slabs list to one entry, and doesn't
> 	scan at all. On the one hand, this locks up one slab in each
> 	cache [in the worst case, a order=5 allocation]. For very
> 	bursty, slightly fragmented caches, it could lead to
> 	more kmem_cache_grow().
>
> 	My patch scans every 2 seconds, and return 80% of the pages.

I would rather eliminate the need for scanning and drop the cachep freelists.  Scanning
every two seconds, from a vm perspective, is probably not a good idea.  We want the
free pages to reach the vm asap - two seconds can be a long time, alternatly it can
be much to fast...  This fails to tie freeing slab memory to vm pressure.

If we need to, I would rather add intelligence to the object deletion path, insuring we 
keep enough slabps around to handle brustly use patterns.  The simpliest way to do 
this would be to add a field to cachep and use this to ensure this number of slabps 
are kept for the cache.  We could default this to zero and set if only for known 'important' 
caches.  It would be beter to make this self tuning though - need to be careful with
the hotpaths in this case.

> 2) regularly flush the per-cpu arrays
> 	<2.5.39 does that in every try_to_free_pages.
>
> 	My patch does that evey 2 seconds, on a random cpu.

Yes this needs to be done - missed this in slabasap...

> 	2.5.39 does that never. It probably reduces cpucache trashing
> 	[alloc batch of 120 objects, return one object, release
> 	batch of 119 objects due to th next try_to_free_pages]
> 	The problem is that without flushing, the cpuarrays can
> 	lock up lots of pages.[in the worst case, several thousand
> 	for each cpu].
>
> The attached patch is WIP:
> * boots on UP
> * partially boots with bochs [cpu simulator], until bochs locks up.
> * contains comments, where I'd add modifications for NUMA.
>
> What do you think?
> For NUMA, is it possible to figure out efficiently into which node a
> pointer points? That would happen in every kmem_cache_free().
>
> Could someone test that it works on real SMP?
>
> What the best replacement for kmem_cache_reap()? 2.5.39 contains
> obviously the simplest approach, but I'm not sure if it's the best.

It does allow quite a bit of code to be make redundent and seems to 
work OK in mm and there have not been complaints in linus (excepting the
kmem_cache_destroy bug).  It also can be extended to handle bursty 
caches that are usually near empty - if benchmarks show its required.

Now to see exactly how you implemented things.

Ed

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATH] slab cleanup
  2002-09-29 12:43 [PATH] slab cleanup Manfred Spraul
  2002-09-29 13:56 ` Ed Tomlinson
@ 2002-09-30  6:55 ` Martin J. Bligh
  2002-09-30 15:59   ` Manfred Spraul
  1 sibling, 1 reply; 4+ messages in thread
From: Martin J. Bligh @ 2002-09-30  6:55 UTC (permalink / raw)
  To: Manfred Spraul, lse-tech
  Cc: akpm, tomlins, Kamble, Nitin A, linux-mm, linux-kernel

> Could someone test that it works on real SMP?

Tested on 16-way NUMA-Q (shows up races quicker than anything ;-)). 
Boots, compiles the kernel 5 times OK. That's good enough for me. 
No performance regression, in fact was marginally faster (within 
experimental error though).

M.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATH] slab cleanup
  2002-09-30  6:55 ` Martin J. Bligh
@ 2002-09-30 15:59   ` Manfred Spraul
  0 siblings, 0 replies; 4+ messages in thread
From: Manfred Spraul @ 2002-09-30 15:59 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: lse-tech, akpm, tomlins, Kamble, Nitin A, linux-mm, linux-kernel

Martin J. Bligh wrote:
>>Could someone test that it works on real SMP?
> 
> 
> Tested on 16-way NUMA-Q (shows up races quicker than anything ;-)). 
> Boots, compiles the kernel 5 times OK. That's good enough for me. 
> No performance regression, in fact was marginally faster (within 
> experimental error though).
> 
Thanks for the test. NUMA is on my TODO list, after figuring out 
where/how to drain cpu caches and the free list.

I've found one stupid bug with debugging enabled: the new debug code 
tries to poison NULL pointers, with limited success :-(

And one limitation might be important for arch specific code: 
kmem_cache_create() during mem_init() is not possible anymore.

--
	Manfred


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-09-30 15:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-29 12:43 [PATH] slab cleanup Manfred Spraul
2002-09-29 13:56 ` Ed Tomlinson
2002-09-30  6:55 ` Martin J. Bligh
2002-09-30 15:59   ` Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).