All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] SLQB slab allocator
@ 2009-01-21 14:30 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 14:30 UTC (permalink / raw)
  To: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton
  Cc: Lin Ming, Zhang, Yanmin, Christoph Lameter

Hi,

Since last posted, I've cleaned up a few bits and pieces, (hopefully)
fixed a known bug where it wouldn't boot on memoryless nodes (I don't
have a system to test with), and improved performance and reduced
locking somewhat for node-specific and interleaved allocations.

There are a few TODOs remaining (see "TODO"). Most are hopefully
obscure or relatively unimportant cases. The biggest thing really
is to test and tune on a wider range of workloads, so I'll ask to
merge it in the slab tree and from there linux-next to see what
comes up. I'll work on tuning things and the TODO items before a
possible mainline merge. Actually it would be kind of instructive
if people run into issues on the TODO list because it would help
guide improvements...

BTW, if anybody wants explicit copyright attribution on the files,
that's fine just send patches. I just dislike big header buildups,
which is why I make a broader acknowledgement. In fact, the other allocators
don't even explictly acknowledge SLAB, so I didn't think it would
be a problem. I don't really know the legal issues, but we've set
plenty of precendent...

---
Introducing the SLQB slab allocator.

SLQB takes code and ideas from all other slab allocators in the tree.

The primary method for keeping lists of free objects within the allocator
is a singly-linked list, storing a pointer within the object memory itself
(or a small additional space in the case of RCU destroyed slabs). This is
like SLOB and SLUB, and opposed to SLAB, which uses arrays of objects, and
metadata. This reduces memory consumption and makes smaller sized objects
more realistic as there is less overhead.

Using lists rather than arrays can reduce the cacheline footprint. When moving
objects around, SLQB can move a list of objects from one CPU to another by
simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
can be touched during alloc/free. Newly freed objects tend to be cache hot,
and newly allocated ones tend to soon be touched anyway, so often there is
little cost to using metadata in the objects.

SLQB has a per-CPU LIFO freelist of objects like SLAB (but using lists rather
than arrays). Freed objects are returned to this freelist if they belong to
the node which our CPU belongs to. So objects allocated on one CPU can be
added to the freelist of another CPU on the same node. When LIFO freelists need
to be refilled or trimmed, SLQB takes or returns objects from a list of slabs.

SLQB has per-CPU lists of slabs (which use struct page as their metadata
including list head for this list). Each slab contains a singly-linked list of
objects that are free in that slab (free, and not on a LIFO freelist). Slabs
are freed as soon as all their objects are freed, and only allocated when there
are no slabs remaining. They are taken off this slab list when if there are no
free objects left. So the slab lists always only contain "partial" slabs; those
slabs which are not completely full and not completely empty. SLQB slabs can be
manipulated with no locking unlike other allocators which tend to use per-node
locks. As the number of threads per socket increases, this should help improve
the scalability of slab operations.

Freeing objects to remote slab lists first batches up the objects on the freeing
CPU, then moves them over at once to a list on the allocating CPU. The allocating
CPU will then notice those objects and pull them onto the end of its freelist.
This remote freeing scheme is designed to minimise the number of cross CPU
cachelines touched, short of going to a "crossbar" arrangement like SLAB has.
SLAB has "crossbars" of arrays of objects. That is, NR_CPUS*MAX_NUMNODES type
arrays, which can become very bloated in huge systems (this could be hundreds
of GBs for kmem caches for 4096 CPU, 1024 nodes systems).

SLQB also has similar freelist, slablist structures per-node, which are
protected by a lock, and usable by any CPU in order to do node specific
allocations. These allocations tend not to be too frequent (short lived
allocations should be node local, long lived allocations should not be
too frequent).

There is a good overview and illustration of the design here:

http://lwn.net/Articles/311502/

By using LIFO freelists like SLAB, SLQB tries to be very page-size agnostic.
It tries very hard to use order-0 pages. This is good for both page allocator
fragmentation, and slab fragmentation.

SLQB initialistaion code attempts to be as simple and un-clever as possible.
There are no multiple phases where different things come up. There is no
weird self bootstrapping stuff. It just statically allocates the structures
required to create the slabs that allocate other slab structures.

SLQB uses much of the debugging infrastructure, and fine-grained sysfs
statistics from SLUB. There is also a Documentation/vm/slqbinfo.c, derived
from slabinfo.c, which can query the sysfs data.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long nr;
+	void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t lock;
+	struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+	int remote_free_check;	/* remote_free has reached a watermark */
+#endif
+	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+	unsigned long nr_partial; /* Number of partial slabs (pages) */
+	struct list_head partial; /* Slabs which have some free objects */
+
+	unsigned long nr_slabs;	/* Total number of slabs allocated */
+
+	//struct list_head full;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list list; /* List for node-local slabs. */
+
+	unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist rlist;
+	struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list list;
+	spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long flags;
+	int hiwater;		/* LIFO list high watermark */
+	int freebatch;		/* LIFO freelist batch flush size */
+	int objsize;		/* The size of an object without meta data */
+	int offset;		/* Free pointer offset. */
+	int objects;		/* Number of objects in slab */
+
+	int size;		/* The size of an object including meta data */
+	int order;		/* Allocation order */
+	gfp_t allocflags;	/* gfp flags to use on allocation */
+	unsigned int colour_range;	/* range of colour counter */
+	unsigned int colour_off;		/* offset per colour */
+	void (*ctor)(void *);
+
+	const char *name;	/* Name (only for display!) */
+	struct list_head list;	/* List of slab caches */
+
+	int align;		/* Alignment */
+	int inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject kobj;	/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG
 
 choice
 	prompt "Choose SLAB allocator"
-	default SLUB
+	default SLQB
 	help
 	   This option allows to select a slab allocator.
 
@@ -827,6 +827,11 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+	help
+	  SLQB is a proposed new slab allocator.
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3436 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ *   they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ *   to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ *   a default closest home node via which it can use fastpath functions.
+ *   Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			unsigned int inuse;	/* Nr of objects */
+		   	struct kmem_cache_list *list; /* Pointer to list */
+			void **freelist;	/* freelist req. slab lock */
+			union {
+				struct list_head lru; /* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+					unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize,
+			s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size)
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE)
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+						void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset)
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+
+	if (s->flags & SLAB_STORE_USER)
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON) &&
+			(!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1) ||
+			 !check_bytes_and_report(s, page, p, "Poison",
+				p + s->objsize - 1, POISON_END, 1)))
+			return 0;
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+								void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str)
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+
+	if (*str == ',')
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+
+	slqb_debug = 0;
+	if (*str == '-')
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name,
+	void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+			{ return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	/* Get the head hot for the likely subsequent allocation or flush */
+	prefetchw(head);
+
+	spin_lock(&l->remote_free.lock);
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+	spin_unlock(&l->remote_free.lock);
+
+	if (!l->freelist.nr)
+		l->freelist.head = head;
+	else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+//		if (next)
+//			prefetchw(next);
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+//			if (next)
+//				prefetchw(next);
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+/*XXX		list_move(&page->lru, &l->full); */
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+	void *object;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return page;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+#ifdef CONFIG_NUMA
+	} else {
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+		spin_unlock(&n->list_lock);
+#endif
+	}
+	VM_BUG_ON(!object);
+	return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	if (unlikely(!n)) /* node has no memory */
+		return NULL;
+	l = &n->list;
+
+//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+//		return NULL;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			spin_unlock(&n->list_lock);
+			return __slab_alloc_page(s, gfpflags, node);
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id()))
+		return __remote_slab_alloc(s, gfpflags, node);
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object))
+			return __slab_alloc_page(s, gfpflags, node);
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+ 
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+	} else {
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static inline int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+		unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	l->cache = s;
+	l->freelist.nr = 0;
+	l->freelist.head = NULL;
+	l->freelist.tail = NULL;
+	l->nr_partial = 0;
+	l->nr_slabs = 0;
+	INIT_LIST_HEAD(&l->partial);
+//	INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check = 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr = 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+			struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next = 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr = 0;
+	c->rlist.head = NULL;
+	c->rlist.tail = NULL;
+	c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER)
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+
+	if (flags & SLAB_RED_ZONE)
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+		const char *name, size_t size,
+		size_t align, unsigned long flags,
+		void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+	return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+
+	/*
+	 * If the node still has available memory. we need kmem_cache_node
+	 * for it yet.
+	 */
+	if (nid < 0)
+		return;
+
+#if 0 // XXX: see cpu offline comment
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_node *n;
+		n = s->node[nid];
+		if (n) {
+			s->node[nid] = NULL;
+			kmem_cache_free(&kmem_node_cache, n);
+		}
+	}
+	up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		if (s->node[nid]) /* could be lefover from last online */
+			continue;
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+		size_t align, unsigned long flags)
+{
+	struct kmem_cache *tmp;
+
+	/*
+	 * Sanity checks... these are all serious usage bugs.
+	 */
+	if (!name || in_interrupt() || (size < sizeof(void *))) {
+		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+				name);
+		dump_stack();
+		return 0;
+	}
+
+	down_read(&slqb_lock);
+	list_for_each_entry(tmp, &slab_caches, list) {
+		char x;
+		int res;
+
+		/*
+		 * This happens when the module gets unloaded and doesn't
+		 * destroy its slab cache and no-one else reuses the vmalloc
+		 * area of the module.  Print a warning.
+		 */
+		res = probe_kernel_address(tmp->name, x);
+		if (res) {
+			printk(KERN_ERR
+			       "SLAB: cache with size %d has lost its name\n",
+			       tmp->size);
+			continue;
+		}
+
+		if (!strcmp(tmp->name, name)) {
+			printk(KERN_ERR
+			       "kmem_cache_create(): duplicate cache %s\n", name);
+			dump_stack();
+			up_read(&slqb_lock);
+			return 0;
+		}
+	}
+	up_read(&slqb_lock);
+
+	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
+	if (flags & SLAB_DESTROY_BY_RCU)
+		WARN_ON(flags & SLAB_POISON);
+
+	return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	if (!kmem_cache_create_ok(name, size, align, flags))
+		goto err;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			if (s->cpu_slab[cpu]) /* could be lefover last online */
+				continue;
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+#if 0
+		down_read(&slqb_lock);
+		/* XXX: this doesn't work because objects can still be on this
+		 * CPU's list. periodic timer needs to check if a CPU is offline
+		 * and then try to cleanup from there. Same for node offline.
+		 */
+		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			if (c) {
+				kmem_cache_free(&kmem_cpu_cache, c);
+				s->cpu_slab[cpu] = NULL;
+			}
+		}
+
+		up_read(&slqb_lock);
+#endif
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		gather->stats[i] += l->stats[i];
+	}
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+			stats->stats[i] += l->stats[i];
+		}
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+		   stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+		   0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr,
+				char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+				struct attribute *attr,
+				const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_available = 1;
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+	up_write(&slqb_lock);
+
+	return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

^ permalink raw reply	[flat|nested] 197+ messages in thread

* [patch] SLQB slab allocator
@ 2009-01-21 14:30 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 14:30 UTC (permalink / raw)
  To: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton
  Cc: Lin Ming, Zhang, Yanmin, Christoph Lameter

Hi,

Since last posted, I've cleaned up a few bits and pieces, (hopefully)
fixed a known bug where it wouldn't boot on memoryless nodes (I don't
have a system to test with), and improved performance and reduced
locking somewhat for node-specific and interleaved allocations.

There are a few TODOs remaining (see "TODO"). Most are hopefully
obscure or relatively unimportant cases. The biggest thing really
is to test and tune on a wider range of workloads, so I'll ask to
merge it in the slab tree and from there linux-next to see what
comes up. I'll work on tuning things and the TODO items before a
possible mainline merge. Actually it would be kind of instructive
if people run into issues on the TODO list because it would help
guide improvements...

BTW, if anybody wants explicit copyright attribution on the files,
that's fine just send patches. I just dislike big header buildups,
which is why I make a broader acknowledgement. In fact, the other allocators
don't even explictly acknowledge SLAB, so I didn't think it would
be a problem. I don't really know the legal issues, but we've set
plenty of precendent...

---
Introducing the SLQB slab allocator.

SLQB takes code and ideas from all other slab allocators in the tree.

The primary method for keeping lists of free objects within the allocator
is a singly-linked list, storing a pointer within the object memory itself
(or a small additional space in the case of RCU destroyed slabs). This is
like SLOB and SLUB, and opposed to SLAB, which uses arrays of objects, and
metadata. This reduces memory consumption and makes smaller sized objects
more realistic as there is less overhead.

Using lists rather than arrays can reduce the cacheline footprint. When moving
objects around, SLQB can move a list of objects from one CPU to another by
simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
can be touched during alloc/free. Newly freed objects tend to be cache hot,
and newly allocated ones tend to soon be touched anyway, so often there is
little cost to using metadata in the objects.

SLQB has a per-CPU LIFO freelist of objects like SLAB (but using lists rather
than arrays). Freed objects are returned to this freelist if they belong to
the node which our CPU belongs to. So objects allocated on one CPU can be
added to the freelist of another CPU on the same node. When LIFO freelists need
to be refilled or trimmed, SLQB takes or returns objects from a list of slabs.

SLQB has per-CPU lists of slabs (which use struct page as their metadata
including list head for this list). Each slab contains a singly-linked list of
objects that are free in that slab (free, and not on a LIFO freelist). Slabs
are freed as soon as all their objects are freed, and only allocated when there
are no slabs remaining. They are taken off this slab list when if there are no
free objects left. So the slab lists always only contain "partial" slabs; those
slabs which are not completely full and not completely empty. SLQB slabs can be
manipulated with no locking unlike other allocators which tend to use per-node
locks. As the number of threads per socket increases, this should help improve
the scalability of slab operations.

Freeing objects to remote slab lists first batches up the objects on the freeing
CPU, then moves them over at once to a list on the allocating CPU. The allocating
CPU will then notice those objects and pull them onto the end of its freelist.
This remote freeing scheme is designed to minimise the number of cross CPU
cachelines touched, short of going to a "crossbar" arrangement like SLAB has.
SLAB has "crossbars" of arrays of objects. That is, NR_CPUS*MAX_NUMNODES type
arrays, which can become very bloated in huge systems (this could be hundreds
of GBs for kmem caches for 4096 CPU, 1024 nodes systems).

SLQB also has similar freelist, slablist structures per-node, which are
protected by a lock, and usable by any CPU in order to do node specific
allocations. These allocations tend not to be too frequent (short lived
allocations should be node local, long lived allocations should not be
too frequent).

There is a good overview and illustration of the design here:

http://lwn.net/Articles/311502/

By using LIFO freelists like SLAB, SLQB tries to be very page-size agnostic.
It tries very hard to use order-0 pages. This is good for both page allocator
fragmentation, and slab fragmentation.

SLQB initialistaion code attempts to be as simple and un-clever as possible.
There are no multiple phases where different things come up. There is no
weird self bootstrapping stuff. It just statically allocates the structures
required to create the slabs that allocate other slab structures.

SLQB uses much of the debugging infrastructure, and fine-grained sysfs
statistics from SLUB. There is also a Documentation/vm/slqbinfo.c, derived
from slabinfo.c, which can query the sysfs data.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long nr;
+	void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t lock;
+	struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+	int remote_free_check;	/* remote_free has reached a watermark */
+#endif
+	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+	unsigned long nr_partial; /* Number of partial slabs (pages) */
+	struct list_head partial; /* Slabs which have some free objects */
+
+	unsigned long nr_slabs;	/* Total number of slabs allocated */
+
+	//struct list_head full;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list list; /* List for node-local slabs. */
+
+	unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist rlist;
+	struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list list;
+	spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long flags;
+	int hiwater;		/* LIFO list high watermark */
+	int freebatch;		/* LIFO freelist batch flush size */
+	int objsize;		/* The size of an object without meta data */
+	int offset;		/* Free pointer offset. */
+	int objects;		/* Number of objects in slab */
+
+	int size;		/* The size of an object including meta data */
+	int order;		/* Allocation order */
+	gfp_t allocflags;	/* gfp flags to use on allocation */
+	unsigned int colour_range;	/* range of colour counter */
+	unsigned int colour_off;		/* offset per colour */
+	void (*ctor)(void *);
+
+	const char *name;	/* Name (only for display!) */
+	struct list_head list;	/* List of slab caches */
+
+	int align;		/* Alignment */
+	int inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject kobj;	/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG
 
 choice
 	prompt "Choose SLAB allocator"
-	default SLUB
+	default SLQB
 	help
 	   This option allows to select a slab allocator.
 
@@ -827,6 +827,11 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+	help
+	  SLQB is a proposed new slab allocator.
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3436 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ *   they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ *   to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ *   a default closest home node via which it can use fastpath functions.
+ *   Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			unsigned int inuse;	/* Nr of objects */
+		   	struct kmem_cache_list *list; /* Pointer to list */
+			void **freelist;	/* freelist req. slab lock */
+			union {
+				struct list_head lru; /* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+					unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize,
+			s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size)
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE)
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+						void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset)
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+
+	if (s->flags & SLAB_STORE_USER)
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON) &&
+			(!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1) ||
+			 !check_bytes_and_report(s, page, p, "Poison",
+				p + s->objsize - 1, POISON_END, 1)))
+			return 0;
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+								void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str)
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+
+	if (*str == ',')
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+
+	slqb_debug = 0;
+	if (*str == '-')
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name,
+	void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+			{ return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	/* Get the head hot for the likely subsequent allocation or flush */
+	prefetchw(head);
+
+	spin_lock(&l->remote_free.lock);
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+	spin_unlock(&l->remote_free.lock);
+
+	if (!l->freelist.nr)
+		l->freelist.head = head;
+	else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+//		if (next)
+//			prefetchw(next);
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+//			if (next)
+//				prefetchw(next);
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+/*XXX		list_move(&page->lru, &l->full); */
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+	void *object;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return page;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+#ifdef CONFIG_NUMA
+	} else {
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+		spin_unlock(&n->list_lock);
+#endif
+	}
+	VM_BUG_ON(!object);
+	return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	if (unlikely(!n)) /* node has no memory */
+		return NULL;
+	l = &n->list;
+
+//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+//		return NULL;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			spin_unlock(&n->list_lock);
+			return __slab_alloc_page(s, gfpflags, node);
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id()))
+		return __remote_slab_alloc(s, gfpflags, node);
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object))
+			return __slab_alloc_page(s, gfpflags, node);
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+ 
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+	} else {
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static inline int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+		unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	l->cache = s;
+	l->freelist.nr = 0;
+	l->freelist.head = NULL;
+	l->freelist.tail = NULL;
+	l->nr_partial = 0;
+	l->nr_slabs = 0;
+	INIT_LIST_HEAD(&l->partial);
+//	INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check = 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr = 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+			struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next = 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr = 0;
+	c->rlist.head = NULL;
+	c->rlist.tail = NULL;
+	c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER)
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+
+	if (flags & SLAB_RED_ZONE)
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+		const char *name, size_t size,
+		size_t align, unsigned long flags,
+		void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+	return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+
+	/*
+	 * If the node still has available memory. we need kmem_cache_node
+	 * for it yet.
+	 */
+	if (nid < 0)
+		return;
+
+#if 0 // XXX: see cpu offline comment
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_node *n;
+		n = s->node[nid];
+		if (n) {
+			s->node[nid] = NULL;
+			kmem_cache_free(&kmem_node_cache, n);
+		}
+	}
+	up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		if (s->node[nid]) /* could be lefover from last online */
+			continue;
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+		size_t align, unsigned long flags)
+{
+	struct kmem_cache *tmp;
+
+	/*
+	 * Sanity checks... these are all serious usage bugs.
+	 */
+	if (!name || in_interrupt() || (size < sizeof(void *))) {
+		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+				name);
+		dump_stack();
+		return 0;
+	}
+
+	down_read(&slqb_lock);
+	list_for_each_entry(tmp, &slab_caches, list) {
+		char x;
+		int res;
+
+		/*
+		 * This happens when the module gets unloaded and doesn't
+		 * destroy its slab cache and no-one else reuses the vmalloc
+		 * area of the module.  Print a warning.
+		 */
+		res = probe_kernel_address(tmp->name, x);
+		if (res) {
+			printk(KERN_ERR
+			       "SLAB: cache with size %d has lost its name\n",
+			       tmp->size);
+			continue;
+		}
+
+		if (!strcmp(tmp->name, name)) {
+			printk(KERN_ERR
+			       "kmem_cache_create(): duplicate cache %s\n", name);
+			dump_stack();
+			up_read(&slqb_lock);
+			return 0;
+		}
+	}
+	up_read(&slqb_lock);
+
+	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
+	if (flags & SLAB_DESTROY_BY_RCU)
+		WARN_ON(flags & SLAB_POISON);
+
+	return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	if (!kmem_cache_create_ok(name, size, align, flags))
+		goto err;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			if (s->cpu_slab[cpu]) /* could be lefover last online */
+				continue;
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+#if 0
+		down_read(&slqb_lock);
+		/* XXX: this doesn't work because objects can still be on this
+		 * CPU's list. periodic timer needs to check if a CPU is offline
+		 * and then try to cleanup from there. Same for node offline.
+		 */
+		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			if (c) {
+				kmem_cache_free(&kmem_cpu_cache, c);
+				s->cpu_slab[cpu] = NULL;
+			}
+		}
+
+		up_read(&slqb_lock);
+#endif
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		gather->stats[i] += l->stats[i];
+	}
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+			stats->stats[i] += l->stats[i];
+		}
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+		   stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+		   0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr,
+				char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+				struct attribute *attr,
+				const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_available = 1;
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+	up_write(&slqb_lock);
+
+	return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:30 ` Nick Piggin
@ 2009-01-21 14:59   ` Ingo Molnar
  -1 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-21 14:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> +/*
> + * Management object for a slab cache.
> + */
> +struct kmem_cache {
> +	unsigned long flags;
> +	int hiwater;		/* LIFO list high watermark */
> +	int freebatch;		/* LIFO freelist batch flush size */
> +	int objsize;		/* The size of an object without meta data */
> +	int offset;		/* Free pointer offset. */
> +	int objects;		/* Number of objects in slab */
> +
> +	int size;		/* The size of an object including meta data */
> +	int order;		/* Allocation order */
> +	gfp_t allocflags;	/* gfp flags to use on allocation */
> +	unsigned int colour_range;	/* range of colour counter */
> +	unsigned int colour_off;		/* offset per colour */
> +	void (*ctor)(void *);
> +

Mind if i nitpick a bit about minor style issues? Since this is going to 
be the next Linux SLAB allocator we might as well do it perfectly :-)

When intoducing new structures it makes sense to properly vertical align 
them, like:

> +	unsigned long		flags;
> +	int			hiwater;	/* LIFO list high watermark  */
> +	int			freebatch;	/* LIFO freelist batch flush size */
> +	int			objsize;	/* Object size without meta data  */
> +	int			offset;		/* Free pointer offset       */
> +	int			objects;	/* Number of objects in slab */
> +	const char		*name;		/* Name (only for display!)  */
> +	struct list_head	list;		/* List of slab caches       */
> +
> +	int			align;		/* Alignment                 */
> +	int			inuse;		/* Offset to metadata        */

because proper vertical alignment/lineup can really help readability.
Like you do it yourself here:

> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;

> +static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
> +{
> +	va_list args;
> +	char buf[100];

magic constant.

> +	if (s->flags & SLAB_RED_ZONE)
> +		memset(p + s->objsize,
> +			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
> +			s->inuse - s->objsize);

We tend to add curly braces in such multi-line statement situations i 
guess.

> +static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
> +{
> +	if (s->flags & SLAB_TRACE) {
> +		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> +			s->name,
> +			alloc ? "alloc" : "free",
> +			object, page->inuse,
> +			page->freelist);

Could use ftrace_printk() here i guess. That way it goes into a fast 
ringbuffer and not printk and it also gets embedded into whatever tracer 
plugin there is active. (for example kmemtrace)


> +static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
> +								void *object)

there's a trick that can be done here to avoid the col-80 artifact:

static void
setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)

ditto all these prototypes:

> +static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static unsigned long kmem_cache_flags(unsigned long objsize,
> +	unsigned long flags, const char *name,
> +	void (*ctor)(void *))
> +static inline void setup_object_debug(struct kmem_cache *s,
> +			struct slqb_page *page, void *object) {}
> +static inline int alloc_debug_processing(struct kmem_cache *s,
> +	void *object, void *addr) { return 0; }
> +static inline int free_debug_processing(struct kmem_cache *s,
> +	void *object, void *addr) { return 0; }
> +static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
> +			void *object, int active) { return 1; }
> +static inline unsigned long kmem_cache_flags(unsigned long objsize,
> +	unsigned long flags, const char *name, void (*ctor)(void *))

> +#define slqb_debug 0

should be 'static const int slqb_debug;' i guess?

more function prototype inconsistencies:

> +static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static void setup_object(struct kmem_cache *s, struct slqb_page *page,
> +				void *object)
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)

> +#ifdef CONFIG_SMP
> +static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
> +#endif

does noline have to be declared?

i almost missed the lock taking here:

> +	spin_lock(&l->remote_free.lock);
> +	l->remote_free.list.head = NULL;
> +	tail = l->remote_free.list.tail;
> +	l->remote_free.list.tail = NULL;
> +	nr = l->remote_free.list.nr;
> +	l->remote_free.list.nr = 0;
> +	spin_unlock(&l->remote_free.lock);

Putting an extra newline after the spin_lock() and one extra newline 
before the spin_unlock() really helps raise attention to critical 
sections.

various leftover bits:

> +//		if (next)
> +//			prefetchw(next);

> +//			if (next)
> +//				prefetchw(next);

> +		list_del(&page->lru);
> +/*XXX		list_move(&page->lru, &l->full); */

> +//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));

overlong prototype:

> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)

putting the newline elsewhere would improve this too:

> +static noinline void *__remote_slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node)

leftover:

> +//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
> +//		return NULL;

newline in wrong place:

> +static __always_inline void *__slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node)

> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node, void *addr)

> +static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)

> +#ifdef CONFIG_SLQB_STATS
> +	{
> +		struct kmem_cache_list *l = &c->list;
> +		slqb_stat_inc(l, FLUSH_RFREE_LIST);
> +		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);

Please put a newline after local variable declarations.

newline in another place could improve this:

> +static __always_inline void __slab_free(struct kmem_cache *s,
> +		struct slqb_page *page, void *object)

> +#ifdef CONFIG_NUMA
> +	} else {
> +		/*
> +		 * Freeing an object that was allocated on a remote node.
> +		 */
> +		slab_free_to_remote(s, page, object, c);
> +		slqb_stat_inc(l, FREE_REMOTE);
> +#endif
> +	}

while it's correct code, the CONFIG_NUMA ifdef begs to be placed one line 
further down.

newline in another place could improve this:

> +static __always_inline void slab_free(struct kmem_cache *s,
> +		struct slqb_page *page, void *object)

> +void kmem_cache_free(struct kmem_cache *s, void *object)
> +{
> +	struct slqb_page *page = NULL;
> +	if (numa_platform)
> +		page = virt_to_head_slqb_page(object);

newline after local variable definition please.

> +static inline int slab_order(int size, int max_order, int frac)
> +{
> +	int order;
> +
> +	if (fls(size - 1) <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order = fls(size - 1) - PAGE_SHIFT;
> +	while (order <= max_order) {

Please put a newline before loops, so that they stand out better.

> +static inline int calculate_order(int size)
> +{
> +	int order;
> +
> +	/*
> +	 * Attempt to find best configuration for a slab. This
> +	 * works by first attempting to generate a layout with
> +	 * the best configuration and backing off gradually.
> +	 */
> +	order = slab_order(size, 1, 4);
> +	if (order <= 1)
> +		return order;
> +
> +	/*
> +	 * This size cannot fit in order-1. Allow bigger orders, but
> +	 * forget about trying to save space.
> +	 */
> +	order = slab_order(size, MAX_ORDER, 0);
> +	if (order <= MAX_ORDER)
> +		return order;
> +
> +	return -ENOSYS;
> +}

function with very nice typographics. All should be like this.

> +	if (flags & SLAB_HWCACHE_ALIGN) {
> +		unsigned long ralign = cache_line_size();
> +		while (size <= ralign / 2)
> +			ralign /= 2;

newline after variables please.

> +static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	l->cache = s;
> +	l->freelist.nr = 0;
> +	l->freelist.head = NULL;
> +	l->freelist.tail = NULL;
> +	l->nr_partial = 0;
> +	l->nr_slabs = 0;
> +	INIT_LIST_HEAD(&l->partial);
> +//	INIT_LIST_HEAD(&l->full);

leftover. Also, initializations tend to read nicer if they are aligned 
like this:

> +	l->cache			= s;
> +	l->freelist.nr			= 0;
> +	l->freelist.head		= NULL;
> +	l->freelist.tail		= NULL;
> +	l->nr_partial			= 0;
> +	l->nr_slabs			= 0;
> +
> +#ifdef CONFIG_SMP
> +	l->remote_free_check		= 0;
> +	spin_lock_init(&l->remote_free.lock);
> +	l->remote_free.list.nr		= 0;
> +	l->remote_free.list.head	= NULL;
> +	l->remote_free.list.tail	= NULL;
> +#endif

As this way it really stands out that the only relevant non-zero 
initializations are l->cache and the spinlock init.

> +static void init_kmem_cache_cpu(struct kmem_cache *s,
> +			struct kmem_cache_cpu *c)

prototype newline.

dead code:

> +#if 0 // XXX: see cpu offline comment
> +	down_read(&slqb_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		struct kmem_cache_node *n;
> +		n = s->node[nid];
> +		if (n) {
> +			s->node[nid] = NULL;
> +			kmem_cache_free(&kmem_node_cache, n);
> +		}
> +	}
> +	up_read(&slqb_lock);
> +#endif

... and many more similar instances are in the patch in other places.

	Ingo

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 14:59   ` Ingo Molnar
  0 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-21 14:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> +/*
> + * Management object for a slab cache.
> + */
> +struct kmem_cache {
> +	unsigned long flags;
> +	int hiwater;		/* LIFO list high watermark */
> +	int freebatch;		/* LIFO freelist batch flush size */
> +	int objsize;		/* The size of an object without meta data */
> +	int offset;		/* Free pointer offset. */
> +	int objects;		/* Number of objects in slab */
> +
> +	int size;		/* The size of an object including meta data */
> +	int order;		/* Allocation order */
> +	gfp_t allocflags;	/* gfp flags to use on allocation */
> +	unsigned int colour_range;	/* range of colour counter */
> +	unsigned int colour_off;		/* offset per colour */
> +	void (*ctor)(void *);
> +

Mind if i nitpick a bit about minor style issues? Since this is going to 
be the next Linux SLAB allocator we might as well do it perfectly :-)

When intoducing new structures it makes sense to properly vertical align 
them, like:

> +	unsigned long		flags;
> +	int			hiwater;	/* LIFO list high watermark  */
> +	int			freebatch;	/* LIFO freelist batch flush size */
> +	int			objsize;	/* Object size without meta data  */
> +	int			offset;		/* Free pointer offset       */
> +	int			objects;	/* Number of objects in slab */
> +	const char		*name;		/* Name (only for display!)  */
> +	struct list_head	list;		/* List of slab caches       */
> +
> +	int			align;		/* Alignment                 */
> +	int			inuse;		/* Offset to metadata        */

because proper vertical alignment/lineup can really help readability.
Like you do it yourself here:

> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;

> +static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
> +{
> +	va_list args;
> +	char buf[100];

magic constant.

> +	if (s->flags & SLAB_RED_ZONE)
> +		memset(p + s->objsize,
> +			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
> +			s->inuse - s->objsize);

We tend to add curly braces in such multi-line statement situations i 
guess.

> +static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
> +{
> +	if (s->flags & SLAB_TRACE) {
> +		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> +			s->name,
> +			alloc ? "alloc" : "free",
> +			object, page->inuse,
> +			page->freelist);

Could use ftrace_printk() here i guess. That way it goes into a fast 
ringbuffer and not printk and it also gets embedded into whatever tracer 
plugin there is active. (for example kmemtrace)


> +static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
> +								void *object)

there's a trick that can be done here to avoid the col-80 artifact:

static void
setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)

ditto all these prototypes:

> +static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static unsigned long kmem_cache_flags(unsigned long objsize,
> +	unsigned long flags, const char *name,
> +	void (*ctor)(void *))
> +static inline void setup_object_debug(struct kmem_cache *s,
> +			struct slqb_page *page, void *object) {}
> +static inline int alloc_debug_processing(struct kmem_cache *s,
> +	void *object, void *addr) { return 0; }
> +static inline int free_debug_processing(struct kmem_cache *s,
> +	void *object, void *addr) { return 0; }
> +static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
> +			void *object, int active) { return 1; }
> +static inline unsigned long kmem_cache_flags(unsigned long objsize,
> +	unsigned long flags, const char *name, void (*ctor)(void *))

> +#define slqb_debug 0

should be 'static const int slqb_debug;' i guess?

more function prototype inconsistencies:

> +static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static void setup_object(struct kmem_cache *s, struct slqb_page *page,
> +				void *object)
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)

> +#ifdef CONFIG_SMP
> +static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
> +#endif

does noline have to be declared?

i almost missed the lock taking here:

> +	spin_lock(&l->remote_free.lock);
> +	l->remote_free.list.head = NULL;
> +	tail = l->remote_free.list.tail;
> +	l->remote_free.list.tail = NULL;
> +	nr = l->remote_free.list.nr;
> +	l->remote_free.list.nr = 0;
> +	spin_unlock(&l->remote_free.lock);

Putting an extra newline after the spin_lock() and one extra newline 
before the spin_unlock() really helps raise attention to critical 
sections.

various leftover bits:

> +//		if (next)
> +//			prefetchw(next);

> +//			if (next)
> +//				prefetchw(next);

> +		list_del(&page->lru);
> +/*XXX		list_move(&page->lru, &l->full); */

> +//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));

overlong prototype:

> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)

putting the newline elsewhere would improve this too:

> +static noinline void *__remote_slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node)

leftover:

> +//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
> +//		return NULL;

newline in wrong place:

> +static __always_inline void *__slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node)

> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node, void *addr)

> +static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)

> +#ifdef CONFIG_SLQB_STATS
> +	{
> +		struct kmem_cache_list *l = &c->list;
> +		slqb_stat_inc(l, FLUSH_RFREE_LIST);
> +		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);

Please put a newline after local variable declarations.

newline in another place could improve this:

> +static __always_inline void __slab_free(struct kmem_cache *s,
> +		struct slqb_page *page, void *object)

> +#ifdef CONFIG_NUMA
> +	} else {
> +		/*
> +		 * Freeing an object that was allocated on a remote node.
> +		 */
> +		slab_free_to_remote(s, page, object, c);
> +		slqb_stat_inc(l, FREE_REMOTE);
> +#endif
> +	}

while it's correct code, the CONFIG_NUMA ifdef begs to be placed one line 
further down.

newline in another place could improve this:

> +static __always_inline void slab_free(struct kmem_cache *s,
> +		struct slqb_page *page, void *object)

> +void kmem_cache_free(struct kmem_cache *s, void *object)
> +{
> +	struct slqb_page *page = NULL;
> +	if (numa_platform)
> +		page = virt_to_head_slqb_page(object);

newline after local variable definition please.

> +static inline int slab_order(int size, int max_order, int frac)
> +{
> +	int order;
> +
> +	if (fls(size - 1) <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order = fls(size - 1) - PAGE_SHIFT;
> +	while (order <= max_order) {

Please put a newline before loops, so that they stand out better.

> +static inline int calculate_order(int size)
> +{
> +	int order;
> +
> +	/*
> +	 * Attempt to find best configuration for a slab. This
> +	 * works by first attempting to generate a layout with
> +	 * the best configuration and backing off gradually.
> +	 */
> +	order = slab_order(size, 1, 4);
> +	if (order <= 1)
> +		return order;
> +
> +	/*
> +	 * This size cannot fit in order-1. Allow bigger orders, but
> +	 * forget about trying to save space.
> +	 */
> +	order = slab_order(size, MAX_ORDER, 0);
> +	if (order <= MAX_ORDER)
> +		return order;
> +
> +	return -ENOSYS;
> +}

function with very nice typographics. All should be like this.

> +	if (flags & SLAB_HWCACHE_ALIGN) {
> +		unsigned long ralign = cache_line_size();
> +		while (size <= ralign / 2)
> +			ralign /= 2;

newline after variables please.

> +static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	l->cache = s;
> +	l->freelist.nr = 0;
> +	l->freelist.head = NULL;
> +	l->freelist.tail = NULL;
> +	l->nr_partial = 0;
> +	l->nr_slabs = 0;
> +	INIT_LIST_HEAD(&l->partial);
> +//	INIT_LIST_HEAD(&l->full);

leftover. Also, initializations tend to read nicer if they are aligned 
like this:

> +	l->cache			= s;
> +	l->freelist.nr			= 0;
> +	l->freelist.head		= NULL;
> +	l->freelist.tail		= NULL;
> +	l->nr_partial			= 0;
> +	l->nr_slabs			= 0;
> +
> +#ifdef CONFIG_SMP
> +	l->remote_free_check		= 0;
> +	spin_lock_init(&l->remote_free.lock);
> +	l->remote_free.list.nr		= 0;
> +	l->remote_free.list.head	= NULL;
> +	l->remote_free.list.tail	= NULL;
> +#endif

As this way it really stands out that the only relevant non-zero 
initializations are l->cache and the spinlock init.

> +static void init_kmem_cache_cpu(struct kmem_cache *s,
> +			struct kmem_cache_cpu *c)

prototype newline.

dead code:

> +#if 0 // XXX: see cpu offline comment
> +	down_read(&slqb_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		struct kmem_cache_node *n;
> +		n = s->node[nid];
> +		if (n) {
> +			s->node[nid] = NULL;
> +			kmem_cache_free(&kmem_node_cache, n);
> +		}
> +	}
> +	up_read(&slqb_lock);
> +#endif

... and many more similar instances are in the patch in other places.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:59   ` Ingo Molnar
@ 2009-01-21 15:17     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > +/*
> > + * Management object for a slab cache.
> > + */
> > +struct kmem_cache {
> > +	unsigned long flags;
> > +	int hiwater;		/* LIFO list high watermark */
> > +	int freebatch;		/* LIFO freelist batch flush size */
> > +	int objsize;		/* The size of an object without meta data */
> > +	int offset;		/* Free pointer offset. */
> > +	int objects;		/* Number of objects in slab */
> > +
> > +	int size;		/* The size of an object including meta data */
> > +	int order;		/* Allocation order */
> > +	gfp_t allocflags;	/* gfp flags to use on allocation */
> > +	unsigned int colour_range;	/* range of colour counter */
> > +	unsigned int colour_off;		/* offset per colour */
> > +	void (*ctor)(void *);
> > +
> 
> Mind if i nitpick a bit about minor style issues? Since this is going to 
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well, let's not get ahead of ourselves :) But it's very appreciated.

I think most if not all of your suggestions are good ones, although
I probably won't convert to ftrace just for the moment.

I'll come up with an incremental patch....

Thanks,
Nick

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 15:17     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > +/*
> > + * Management object for a slab cache.
> > + */
> > +struct kmem_cache {
> > +	unsigned long flags;
> > +	int hiwater;		/* LIFO list high watermark */
> > +	int freebatch;		/* LIFO freelist batch flush size */
> > +	int objsize;		/* The size of an object without meta data */
> > +	int offset;		/* Free pointer offset. */
> > +	int objects;		/* Number of objects in slab */
> > +
> > +	int size;		/* The size of an object including meta data */
> > +	int order;		/* Allocation order */
> > +	gfp_t allocflags;	/* gfp flags to use on allocation */
> > +	unsigned int colour_range;	/* range of colour counter */
> > +	unsigned int colour_off;		/* offset per colour */
> > +	void (*ctor)(void *);
> > +
> 
> Mind if i nitpick a bit about minor style issues? Since this is going to 
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well, let's not get ahead of ourselves :) But it's very appreciated.

I think most if not all of your suggestions are good ones, although
I probably won't convert to ftrace just for the moment.

I'll come up with an incremental patch....

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:59   ` Ingo Molnar
@ 2009-01-21 16:56     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 16:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> 
> Mind if i nitpick a bit about minor style issues? Since this is going to 
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well here is an incremental patch which should get most of the issues you
pointed out, most of the sane ones that checkpatch pointed out, and a
few of my own ;)

---
 include/linux/slqb_def.h |   90 +++++-----
 mm/slqb.c                |  386 +++++++++++++++++++++++++----------------------
 2 files changed, 261 insertions(+), 215 deletions(-)

Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -37,8 +37,9 @@ enum stat_item {
  * Singly-linked list with head, tail, and nr
  */
 struct kmlist {
-	unsigned long nr;
-	void **head, **tail;
+	unsigned long	nr;
+	void 		**head;
+	void		**tail;
 };
 
 /*
@@ -46,8 +47,8 @@ struct kmlist {
  * objects can be returned to the kmem_cache_list from remote CPUs.
  */
 struct kmem_cache_remote_free {
-	spinlock_t lock;
-	struct kmlist list;
+	spinlock_t	lock;
+	struct kmlist	list;
 } ____cacheline_aligned;
 
 /*
@@ -56,18 +57,23 @@ struct kmem_cache_remote_free {
  * kmem_cache_lists allow off-node allocations (but require locking).
  */
 struct kmem_cache_list {
-	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+				/* Fastpath LIFO freelist of objects */
+	struct kmlist		freelist;
 #ifdef CONFIG_SMP
-	int remote_free_check;	/* remote_free has reached a watermark */
+				/* remote_free has reached a watermark */
+	int			remote_free_check;
 #endif
-	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+				/* kmem_cache corresponding to this list */
+	struct kmem_cache	*cache;
 
-	unsigned long nr_partial; /* Number of partial slabs (pages) */
-	struct list_head partial; /* Slabs which have some free objects */
+				/* Number of partial slabs (pages) */
+	unsigned long		nr_partial;
 
-	unsigned long nr_slabs;	/* Total number of slabs allocated */
+				/* Slabs which have some free objects */
+	struct list_head	partial;
 
-	//struct list_head full;
+				/* Total number of slabs allocated */
+	unsigned long		nr_slabs;
 
 #ifdef CONFIG_SMP
 	/*
@@ -79,7 +85,7 @@ struct kmem_cache_list {
 #endif
 
 #ifdef CONFIG_SLQB_STATS
-	unsigned long stats[NR_SLQB_STAT_ITEMS];
+	unsigned long		stats[NR_SLQB_STAT_ITEMS];
 #endif
 } ____cacheline_aligned;
 
@@ -87,9 +93,8 @@ struct kmem_cache_list {
  * Primary per-cpu, per-kmem_cache structure.
  */
 struct kmem_cache_cpu {
-	struct kmem_cache_list list; /* List for node-local slabs. */
-
-	unsigned int colour_next;
+	struct kmem_cache_list	list;		/* List for node-local slabs */
+	unsigned int		colour_next;	/* Next colour offset to use */
 
 #ifdef CONFIG_SMP
 	/*
@@ -101,53 +106,53 @@ struct kmem_cache_cpu {
 	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
 	 * get to O(NR_CPUS^2) memory consumption situation.
 	 */
-	struct kmlist rlist;
-	struct kmem_cache_list *remote_cache_list;
+	struct kmlist		rlist;
+	struct kmem_cache_list	*remote_cache_list;
 #endif
 } ____cacheline_aligned;
 
 /*
- * Per-node, per-kmem_cache structure.
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
  */
 struct kmem_cache_node {
-	struct kmem_cache_list list;
-	spinlock_t list_lock; /* protects access to list */
+	struct kmem_cache_list	list;
+	spinlock_t		list_lock;	/* protects access to list */
 } ____cacheline_aligned;
 
 /*
  * Management object for a slab cache.
  */
 struct kmem_cache {
-	unsigned long flags;
-	int hiwater;		/* LIFO list high watermark */
-	int freebatch;		/* LIFO freelist batch flush size */
-	int objsize;		/* The size of an object without meta data */
-	int offset;		/* Free pointer offset. */
-	int objects;		/* Number of objects in slab */
-
-	int size;		/* The size of an object including meta data */
-	int order;		/* Allocation order */
-	gfp_t allocflags;	/* gfp flags to use on allocation */
-	unsigned int colour_range;	/* range of colour counter */
-	unsigned int colour_off;		/* offset per colour */
-	void (*ctor)(void *);
+	unsigned long	flags;
+	int		hiwater;	/* LIFO list high watermark */
+	int		freebatch;	/* LIFO freelist batch flush size */
+	int		objsize;	/* Size of object without meta data */
+	int		offset;		/* Free pointer offset. */
+	int		objects;	/* Number of objects in slab */
+
+	int		size;		/* Size of object including meta data */
+	int		order;		/* Allocation order */
+	gfp_t		allocflags;	/* gfp flags to use on allocation */
+	unsigned int	colour_range;	/* range of colour counter */
+	unsigned int	colour_off;	/* offset per colour */
+	void		(*ctor)(void *);
 
-	const char *name;	/* Name (only for display!) */
-	struct list_head list;	/* List of slab caches */
+	const char	*name;		/* Name (only for display!) */
+	struct list_head list;		/* List of slab caches */
 
-	int align;		/* Alignment */
-	int inuse;		/* Offset to metadata */
+	int		align;		/* Alignment */
+	int		inuse;		/* Offset to metadata */
 
 #ifdef CONFIG_SLQB_SYSFS
-	struct kobject kobj;	/* For sysfs */
+	struct kobject	kobj;		/* For sysfs */
 #endif
 #ifdef CONFIG_NUMA
-	struct kmem_cache_node *node[MAX_NUMNODES];
+	struct kmem_cache_node	*node[MAX_NUMNODES];
 #endif
 #ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+	struct kmem_cache_cpu	*cpu_slab[NR_CPUS];
 #else
-	struct kmem_cache_cpu cpu_slab;
+	struct kmem_cache_cpu	cpu_slab;
 #endif
 };
 
@@ -245,7 +250,8 @@ void *__kmalloc(size_t size, gfp_t flags
 #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
 #endif
 
-#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ?	\
+				sizeof(void *) : ARCH_KMALLOC_MINALIGN)
 
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -40,13 +40,13 @@
 struct slqb_page {
 	union {
 		struct {
-			unsigned long flags;	/* mandatory */
-			atomic_t _count;	/* mandatory */
-			unsigned int inuse;	/* Nr of objects */
-		   	struct kmem_cache_list *list; /* Pointer to list */
-			void **freelist;	/* freelist req. slab lock */
+			unsigned long	flags;		/* mandatory */
+			atomic_t	_count;		/* mandatory */
+			unsigned int	inuse;		/* Nr of objects */
+			struct kmem_cache_list *list;	/* Pointer to list */
+			void		 **freelist;	/* LIFO freelist */
 			union {
-				struct list_head lru; /* misc. list */
+				struct list_head lru;	/* misc. list */
 				struct rcu_head rcu_head; /* for rcu freeing */
 			};
 		};
@@ -62,7 +62,7 @@ static int kmem_size __read_mostly;
 #ifdef CONFIG_NUMA
 static int numa_platform __read_mostly;
 #else
-#define numa_platform 0
+static const int numa_platform = 0;
 #endif
 
 static inline int slab_hiwater(struct kmem_cache *s)
@@ -120,15 +120,16 @@ static inline int slab_freebatch(struct
  * - There is no remote free queue. Nodes don't free objects, CPUs do.
  */
 
-static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+				enum stat_item si)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si]++;
 #endif
 }
 
-static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
-					unsigned long nr)
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+				enum stat_item si, unsigned long nr)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si] += nr;
@@ -433,10 +434,11 @@ static void print_page_info(struct slqb_
 
 }
 
+#define MAX_ERR_STR 100
 static void slab_bug(struct kmem_cache *s, char *fmt, ...)
 {
 	va_list args;
-	char buf[100];
+	char buf[MAX_ERR_STR];
 
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
@@ -477,8 +479,7 @@ static void print_trailer(struct kmem_ca
 	print_section("Object", p, min(s->objsize, 128));
 
 	if (s->flags & SLAB_RED_ZONE)
-		print_section("Redzone", p + s->objsize,
-			s->inuse - s->objsize);
+		print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
 
 	if (s->offset)
 		off = s->offset + sizeof(void *);
@@ -488,9 +489,10 @@ static void print_trailer(struct kmem_ca
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
 
-	if (off != s->size)
+	if (off != s->size) {
 		/* Beginning of the filler is the free pointer */
 		print_section("Padding", p + off, s->size - off);
+	}
 
 	dump_stack();
 }
@@ -502,14 +504,9 @@ static void object_err(struct kmem_cache
 	print_trailer(s, page, object);
 }
 
-static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+			char *fmt, ...)
 {
-	va_list args;
-	char buf[100];
-
-	va_start(args, fmt);
-	vsnprintf(buf, sizeof(buf), fmt, args);
-	va_end(args);
 	slab_bug(s, fmt);
 	print_page_info(page);
 	dump_stack();
@@ -524,10 +521,11 @@ static void init_object(struct kmem_cach
 		p[s->objsize - 1] = POISON_END;
 	}
 
-	if (s->flags & SLAB_RED_ZONE)
+	if (s->flags & SLAB_RED_ZONE) {
 		memset(p + s->objsize,
 			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
 			s->inuse - s->objsize);
+	}
 }
 
 static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
@@ -542,7 +540,7 @@ static u8 *check_bytes(u8 *start, unsign
 }
 
 static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
-						void *from, void *to)
+				void *from, void *to)
 {
 	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
 	memset(from, data, to - from);
@@ -610,13 +608,15 @@ static int check_pad_bytes(struct kmem_c
 {
 	unsigned long off = s->inuse;	/* The end of info */
 
-	if (s->offset)
+	if (s->offset) {
 		/* Freepointer is placed after the object. */
 		off += sizeof(void *);
+	}
 
-	if (s->flags & SLAB_STORE_USER)
+	if (s->flags & SLAB_STORE_USER) {
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
+	}
 
 	if (s->size == off)
 		return 1;
@@ -646,6 +646,7 @@ static int slab_pad_check(struct kmem_ca
 	fault = check_bytes(start + length, POISON_INUSE, remainder);
 	if (!fault)
 		return 1;
+
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
@@ -677,12 +678,16 @@ static int check_object(struct kmem_cach
 	}
 
 	if (s->flags & SLAB_POISON) {
-		if (!active && (s->flags & __OBJECT_POISON) &&
-			(!check_bytes_and_report(s, page, p, "Poison", p,
-					POISON_FREE, s->objsize - 1) ||
-			 !check_bytes_and_report(s, page, p, "Poison",
-				p + s->objsize - 1, POISON_END, 1)))
-			return 0;
+		if (!active && (s->flags & __OBJECT_POISON)) {
+			if (!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1))
+				return 0;
+
+			if (!check_bytes_and_report(s, page, p, "Poison",
+					p + s->objsize - 1, POISON_END, 1))
+				return 0;
+		}
+
 		/*
 		 * check_pad_bytes cleans up on its own.
 		 */
@@ -712,7 +717,8 @@ static int check_slab(struct kmem_cache
 	return 1;
 }
 
-static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
 		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
@@ -729,7 +735,7 @@ static void trace(struct kmem_cache *s,
 }
 
 static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
-								void *object)
+				void *object)
 {
 	if (!slab_debug(s))
 		return;
@@ -741,7 +747,8 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int alloc_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
 {
 	struct slqb_page *page;
 	page = virt_to_head_slqb_page(object);
@@ -768,7 +775,8 @@ bad:
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
 {
 	struct slqb_page *page;
 	page = virt_to_head_slqb_page(object);
@@ -799,25 +807,28 @@ fail:
 static int __init setup_slqb_debug(char *str)
 {
 	slqb_debug = DEBUG_DEFAULT_FLAGS;
-	if (*str++ != '=' || !*str)
+	if (*str++ != '=' || !*str) {
 		/*
 		 * No options specified. Switch on full debugging.
 		 */
 		goto out;
+	}
 
-	if (*str == ',')
+	if (*str == ',') {
 		/*
 		 * No options but restriction on slabs. This means full
 		 * debugging for slabs matching a pattern.
 		 */
 		goto check_slabs;
+	}
 
 	slqb_debug = 0;
-	if (*str == '-')
+	if (*str == '-') {
 		/*
 		 * Switch off all debugging measures.
 		 */
 		goto out;
+	}
 
 	/*
 	 * Determine which debug features should be switched on
@@ -855,8 +866,8 @@ out:
 __setup("slqb_debug", setup_slqb_debug);
 
 static unsigned long kmem_cache_flags(unsigned long objsize,
-	unsigned long flags, const char *name,
-	void (*ctor)(void *))
+				unsigned long flags, const char *name,
+				void (*ctor)(void *))
 {
 	/*
 	 * Enable debugging if selected on the kernel commandline.
@@ -870,31 +881,51 @@ static unsigned long kmem_cache_flags(un
 }
 #else
 static inline void setup_object_debug(struct kmem_cache *s,
-			struct slqb_page *page, void *object) {}
+			struct slqb_page *page, void *object)
+{
+}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	void *object, void *addr) { return 0; }
+			void *object, void *addr)
+{
+	return 0;
+}
 
 static inline int free_debug_processing(struct kmem_cache *s,
-	void *object, void *addr) { return 0; }
+			void *object, void *addr)
+{
+	return 0;
+}
 
 static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
-			{ return 1; }
+{
+	return 1;
+}
+
 static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+			void *object, int active)
+{
+	return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name, void (*ctor)(void *))
 {
 	return flags;
 }
-#define slqb_debug 0
+
+static const int slqb_debug = 0;
 #endif
 
 /*
  * allocate a new slab (return its corresponding struct slqb_page)
  */
-static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+					gfp_t flags, int node)
 {
 	struct slqb_page *page;
 	int pages = 1 << s->order;
@@ -916,8 +947,8 @@ static struct slqb_page *allocate_slab(s
 /*
  * Called once for each object on a new slab page
  */
-static void setup_object(struct kmem_cache *s, struct slqb_page *page,
-				void *object)
+static void setup_object(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
@@ -927,7 +958,8 @@ static void setup_object(struct kmem_cac
 /*
  * Allocate a new slab, set up its object list.
  */
-static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+				gfp_t flags, int node, unsigned int colour)
 {
 	struct slqb_page *page;
 	void *start;
@@ -1010,7 +1042,9 @@ static void free_slab(struct kmem_cache
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+static int free_object_to_page(struct kmem_cache *s,
+			struct kmem_cache_list *l, struct slqb_page *page,
+			void *object)
 {
 	VM_BUG_ON(page->list != l);
 
@@ -1027,6 +1061,7 @@ static int free_object_to_page(struct km
 		free_slab(s, page);
 		slqb_stat_inc(l, FLUSH_SLAB_FREE);
 		return 1;
+
 	} else if (page->inuse + 1 == s->objects) {
 		l->nr_partial++;
 		list_add(&page->lru, &l->partial);
@@ -1037,7 +1072,8 @@ static int free_object_to_page(struct km
 }
 
 #ifdef CONFIG_SMP
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+				void *object, struct kmem_cache_cpu *c);
 #endif
 
 /*
@@ -1110,7 +1146,8 @@ static void flush_free_list_all(struct k
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void claim_remote_free_list(struct kmem_cache *s,
+					struct kmem_cache_list *l)
 {
 	void **head, **tail;
 	int nr;
@@ -1126,11 +1163,13 @@ static void claim_remote_free_list(struc
 	prefetchw(head);
 
 	spin_lock(&l->remote_free.lock);
+
 	l->remote_free.list.head = NULL;
 	tail = l->remote_free.list.tail;
 	l->remote_free.list.tail = NULL;
 	nr = l->remote_free.list.nr;
 	l->remote_free.list.nr = 0;
+
 	spin_unlock(&l->remote_free.lock);
 
 	if (!l->freelist.nr)
@@ -1153,18 +1192,19 @@ static void claim_remote_free_list(struc
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+						struct kmem_cache_list *l)
 {
 	void *object;
 
 	object = l->freelist.head;
 	if (likely(object)) {
 		void *next = get_freepointer(s, object);
+
 		VM_BUG_ON(!l->freelist.nr);
 		l->freelist.nr--;
 		l->freelist.head = next;
-//		if (next)
-//			prefetchw(next);
+
 		return object;
 	}
 	VM_BUG_ON(l->freelist.nr);
@@ -1180,11 +1220,11 @@ static __always_inline void *__cache_lis
 		object = l->freelist.head;
 		if (likely(object)) {
 			void *next = get_freepointer(s, object);
+
 			VM_BUG_ON(!l->freelist.nr);
 			l->freelist.nr--;
 			l->freelist.head = next;
-//			if (next)
-//				prefetchw(next);
+
 			return object;
 		}
 		VM_BUG_ON(l->freelist.nr);
@@ -1203,7 +1243,8 @@ static __always_inline void *__cache_lis
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+				struct kmem_cache_list *l)
 {
 	struct slqb_page *page;
 	void *object;
@@ -1216,15 +1257,12 @@ static noinline void *__cache_list_get_p
 	if (page->inuse + 1 == s->objects) {
 		l->nr_partial--;
 		list_del(&page->lru);
-/*XXX		list_move(&page->lru, &l->full); */
 	}
 
 	VM_BUG_ON(!page->freelist);
 
 	page->inuse++;
 
-//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
-
 	object = page->freelist;
 	page->freelist = get_freepointer(s, object);
 	if (page->freelist)
@@ -1244,7 +1282,8 @@ static noinline void *__cache_list_get_p
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
 {
 	struct slqb_page *page;
 	struct kmem_cache_list *l;
@@ -1285,8 +1324,8 @@ static noinline void *__slab_alloc_page(
 		slqb_stat_inc(l, ALLOC);
 		slqb_stat_inc(l, ALLOC_SLAB_NEW);
 		object = __cache_list_get_page(s, l);
-#ifdef CONFIG_NUMA
 	} else {
+#ifdef CONFIG_NUMA
 		struct kmem_cache_node *n;
 
 		n = s->node[slqb_page_to_nid(page)];
@@ -1308,7 +1347,8 @@ static noinline void *__slab_alloc_page(
 }
 
 #ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline int alternate_nid(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
 {
 	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
 		return node;
@@ -1326,7 +1366,7 @@ static noinline int alternate_nid(struct
  * Must be called with interrupts disabled.
  */
 static noinline void *__remote_slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node)
+				gfp_t gfpflags, int node)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache_list *l;
@@ -1337,9 +1377,6 @@ static noinline void *__remote_slab_allo
 		return NULL;
 	l = &n->list;
 
-//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
-//		return NULL;
-
 	spin_lock(&n->list_lock);
 
 	object = __cache_list_get_object(s, l);
@@ -1363,7 +1400,7 @@ static noinline void *__remote_slab_allo
  * Must be called with interrupts disabled.
  */
 static __always_inline void *__slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node)
+				gfp_t gfpflags, int node)
 {
 	void *object;
 	struct kmem_cache_cpu *c;
@@ -1393,7 +1430,7 @@ static __always_inline void *__slab_allo
  * (debug checking and memset()ing).
  */
 static __always_inline void *slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node, void *addr)
+				gfp_t gfpflags, int node, void *addr)
 {
 	void *object;
 	unsigned long flags;
@@ -1414,7 +1451,8 @@ again:
 	return object;
 }
 
-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, void *caller)
 {
 	int node = -1;
 #ifdef CONFIG_NUMA
@@ -1449,7 +1487,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  *
  * Must be called with interrupts disabled.
  */
-static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static void flush_remote_free_cache(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
 {
 	struct kmlist *src;
 	struct kmem_cache_list *dst;
@@ -1464,6 +1503,7 @@ static void flush_remote_free_cache(stru
 #ifdef CONFIG_SLQB_STATS
 	{
 		struct kmem_cache_list *l = &c->list;
+
 		slqb_stat_inc(l, FLUSH_RFREE_LIST);
 		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
 	}
@@ -1472,6 +1512,7 @@ static void flush_remote_free_cache(stru
 	dst = c->remote_cache_list;
 
 	spin_lock(&dst->remote_free.lock);
+
 	if (!dst->remote_free.list.head)
 		dst->remote_free.list.head = src->head;
 	else
@@ -1500,7 +1541,9 @@ static void flush_remote_free_cache(stru
  *
  * Must be called with interrupts disabled.
  */
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+				struct slqb_page *page, void *object,
+				struct kmem_cache_cpu *c)
 {
 	struct kmlist *r;
 
@@ -1526,14 +1569,14 @@ static noinline void slab_free_to_remote
 		flush_remote_free_cache(s, c);
 }
 #endif
- 
+
 /*
  * Main freeing path. Return an object, or NULL on allocation failure.
  *
  * Must be called with interrupts disabled.
  */
 static __always_inline void __slab_free(struct kmem_cache *s,
-		struct slqb_page *page, void *object)
+				struct slqb_page *page, void *object)
 {
 	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
@@ -1561,8 +1604,8 @@ static __always_inline void __slab_free(
 		if (unlikely(l->freelist.nr > slab_hiwater(s)))
 			flush_free_list(s, l);
 
-#ifdef CONFIG_NUMA
 	} else {
+#ifdef CONFIG_NUMA
 		/*
 		 * Freeing an object that was allocated on a remote node.
 		 */
@@ -1577,7 +1620,7 @@ static __always_inline void __slab_free(
  * (debug checking).
  */
 static __always_inline void slab_free(struct kmem_cache *s,
-		struct slqb_page *page, void *object)
+				struct slqb_page *page, void *object)
 {
 	unsigned long flags;
 
@@ -1597,6 +1640,7 @@ static __always_inline void slab_free(st
 void kmem_cache_free(struct kmem_cache *s, void *object)
 {
 	struct slqb_page *page = NULL;
+
 	if (numa_platform)
 		page = virt_to_head_slqb_page(object);
 	slab_free(s, page, object);
@@ -1610,7 +1654,7 @@ EXPORT_SYMBOL(kmem_cache_free);
  * in the page allocator, and they have fastpaths in the page allocator. But
  * also minimise external fragmentation with large objects.
  */
-static inline int slab_order(int size, int max_order, int frac)
+static int slab_order(int size, int max_order, int frac)
 {
 	int order;
 
@@ -1618,6 +1662,7 @@ static inline int slab_order(int size, i
 		order = 0;
 	else
 		order = fls(size - 1) - PAGE_SHIFT;
+
 	while (order <= max_order) {
 		unsigned long slab_size = PAGE_SIZE << order;
 		unsigned long objects;
@@ -1638,7 +1683,7 @@ static inline int slab_order(int size, i
 	return order;
 }
 
-static inline int calculate_order(int size)
+static int calculate_order(int size)
 {
 	int order;
 
@@ -1666,7 +1711,7 @@ static inline int calculate_order(int si
  * Figure out what the alignment of the objects will be.
  */
 static unsigned long calculate_alignment(unsigned long flags,
-		unsigned long align, unsigned long size)
+				unsigned long align, unsigned long size)
 {
 	/*
 	 * If the user wants hardware cache aligned objects then follow that
@@ -1677,6 +1722,7 @@ static unsigned long calculate_alignment
 	 */
 	if (flags & SLAB_HWCACHE_ALIGN) {
 		unsigned long ralign = cache_line_size();
+
 		while (size <= ralign / 2)
 			ralign /= 2;
 		align = max(align, ralign);
@@ -1688,21 +1734,21 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void init_kmem_cache_list(struct kmem_cache *s,
+				struct kmem_cache_list *l)
 {
-	l->cache = s;
-	l->freelist.nr = 0;
-	l->freelist.head = NULL;
-	l->freelist.tail = NULL;
-	l->nr_partial = 0;
-	l->nr_slabs = 0;
+	l->cache		= s;
+	l->freelist.nr		= 0;
+	l->freelist.head	= NULL;
+	l->freelist.tail	= NULL;
+	l->nr_partial		= 0;
+	l->nr_slabs		= 0;
 	INIT_LIST_HEAD(&l->partial);
-//	INIT_LIST_HEAD(&l->full);
 
 #ifdef CONFIG_SMP
-	l->remote_free_check = 0;
+	l->remote_free_check	= 0;
 	spin_lock_init(&l->remote_free.lock);
-	l->remote_free.list.nr = 0;
+	l->remote_free.list.nr	= 0;
 	l->remote_free.list.head = NULL;
 	l->remote_free.list.tail = NULL;
 #endif
@@ -1713,21 +1759,22 @@ static void init_kmem_cache_list(struct
 }
 
 static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
+				struct kmem_cache_cpu *c)
 {
 	init_kmem_cache_list(s, &c->list);
 
-	c->colour_next = 0;
+	c->colour_next		= 0;
 #ifdef CONFIG_SMP
-	c->rlist.nr = 0;
-	c->rlist.head = NULL;
-	c->rlist.tail = NULL;
-	c->remote_cache_list = NULL;
+	c->rlist.nr		= 0;
+	c->rlist.head		= NULL;
+	c->rlist.tail		= NULL;
+	c->remote_cache_list	= NULL;
 #endif
 }
 
 #ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+static void init_kmem_cache_node(struct kmem_cache *s,
+				struct kmem_cache_node *n)
 {
 	spin_lock_init(&n->list_lock);
 	init_kmem_cache_list(s, &n->list);
@@ -1757,7 +1804,8 @@ static struct kmem_cache_node kmem_node_
 #endif
 
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+				int cpu)
 {
 	struct kmem_cache_cpu *c;
 
@@ -1918,14 +1966,15 @@ static int calculate_sizes(struct kmem_c
 	}
 
 #ifdef CONFIG_SLQB_DEBUG
-	if (flags & SLAB_STORE_USER)
+	if (flags & SLAB_STORE_USER) {
 		/*
 		 * Need to store information about allocs and frees after
 		 * the object.
 		 */
 		size += 2 * sizeof(struct track);
+	}
 
-	if (flags & SLAB_RED_ZONE)
+	if (flags & SLAB_RED_ZONE) {
 		/*
 		 * Add some empty padding so that we can catch
 		 * overwrites from earlier objects rather than let
@@ -1934,6 +1983,7 @@ static int calculate_sizes(struct kmem_c
 		 * of the object.
 		 */
 		size += sizeof(void *);
+	}
 #endif
 
 	/*
@@ -1970,7 +2020,8 @@ static int calculate_sizes(struct kmem_c
 	 */
 	s->objects = (PAGE_SIZE << s->order) / size;
 
-	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	s->freebatch = max(4UL*PAGE_SIZE / size,
+				min(256UL, 64*PAGE_SIZE / size));
 	if (!s->freebatch)
 		s->freebatch = 1;
 	s->hiwater = s->freebatch << 2;
@@ -1980,9 +2031,8 @@ static int calculate_sizes(struct kmem_c
 }
 
 static int kmem_cache_open(struct kmem_cache *s,
-		const char *name, size_t size,
-		size_t align, unsigned long flags,
-		void (*ctor)(void *), int alloc)
+			const char *name, size_t size, size_t align,
+			unsigned long flags, void (*ctor)(void *), int alloc)
 {
 	unsigned int left_over;
 
@@ -2024,7 +2074,7 @@ error_nodes:
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
-		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
 	return 0;
 }
 
@@ -2141,7 +2191,7 @@ EXPORT_SYMBOL(kmalloc_caches_dma);
 #endif
 
 static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
-		const char *name, int size, gfp_t gfp_flags)
+				const char *name, int size, gfp_t gfp_flags)
 {
 	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
 
@@ -2446,10 +2496,10 @@ static int __init cpucache_init(void)
 
 	for_each_online_cpu(cpu)
 		start_cpu_timer(cpu);
+
 	return 0;
 }
-__initcall(cpucache_init);
-
+device_initcall(cpucache_init);
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
 static void slab_mem_going_offline_callback(void *arg)
@@ -2459,29 +2509,7 @@ static void slab_mem_going_offline_callb
 
 static void slab_mem_offline_callback(void *arg)
 {
-	struct kmem_cache *s;
-	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid;
-
-	/*
-	 * If the node still has available memory. we need kmem_cache_node
-	 * for it yet.
-	 */
-	if (nid < 0)
-		return;
-
-#if 0 // XXX: see cpu offline comment
-	down_read(&slqb_lock);
-	list_for_each_entry(s, &slab_caches, list) {
-		struct kmem_cache_node *n;
-		n = s->node[nid];
-		if (n) {
-			s->node[nid] = NULL;
-			kmem_cache_free(&kmem_node_cache, n);
-		}
-	}
-	up_read(&slqb_lock);
-#endif
+	/* XXX: should release structures, see CPU offline comment */
 }
 
 static int slab_mem_going_online_callback(void *arg)
@@ -2562,6 +2590,10 @@ void __init kmem_cache_init(void)
 	int i;
 	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
 
+	/*
+	 * All the ifdefs are rather ugly here, but it's just the setup code,
+	 * so it doesn't have to be too readable :)
+	 */
 #ifdef CONFIG_NUMA
 	if (num_possible_nodes() == 1)
 		numa_platform = 0;
@@ -2576,12 +2608,15 @@ void __init kmem_cache_init(void)
 	kmem_size = sizeof(struct kmem_cache);
 #endif
 
-	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+			kmem_size, 0, flags, NULL, 0);
 #ifdef CONFIG_SMP
-	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+			sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
 #endif
 #ifdef CONFIG_NUMA
-	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+			sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
 #endif
 
 #ifdef CONFIG_SMP
@@ -2634,14 +2669,13 @@ void __init kmem_cache_init(void)
 
 	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
 		open_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, GFP_KERNEL);
+				"kmalloc", 1 << i, GFP_KERNEL);
 #ifdef CONFIG_ZONE_DMA
 		open_kmalloc_cache(&kmalloc_caches_dma[i],
 				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
 #endif
 	}
 
-
 	/*
 	 * Patch up the size_index table if we have strange large alignment
 	 * requirements for the kmalloc array. This is only the case for
@@ -2697,10 +2731,12 @@ static int kmem_cache_create_ok(const ch
 		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
 				name);
 		dump_stack();
+
 		return 0;
 	}
 
 	down_read(&slqb_lock);
+
 	list_for_each_entry(tmp, &slab_caches, list) {
 		char x;
 		int res;
@@ -2723,9 +2759,11 @@ static int kmem_cache_create_ok(const ch
 			       "kmem_cache_create(): duplicate cache %s\n", name);
 			dump_stack();
 			up_read(&slqb_lock);
+
 			return 0;
 		}
 	}
+
 	up_read(&slqb_lock);
 
 	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
@@ -2754,7 +2792,8 @@ struct kmem_cache *kmem_cache_create(con
 
 err:
 	if (flags & SLAB_PANIC)
-		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
 	return NULL;
 }
 EXPORT_SYMBOL(kmem_cache_create);
@@ -2765,7 +2804,7 @@ EXPORT_SYMBOL(kmem_cache_create);
  * necessary.
  */
 static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action, void *hcpu)
+				unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	struct kmem_cache *s;
@@ -2803,23 +2842,12 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-#if 0
-		down_read(&slqb_lock);
-		/* XXX: this doesn't work because objects can still be on this
-		 * CPU's list. periodic timer needs to check if a CPU is offline
-		 * and then try to cleanup from there. Same for node offline.
+		/*
+		 * XXX: Freeing here doesn't work because objects can still be
+		 * on this CPU's list. periodic timer needs to check if a CPU
+		 * is offline and then try to cleanup from there. Same for node
+		 * offline.
 		 */
-		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-			if (c) {
-				kmem_cache_free(&kmem_cpu_cache, c);
-				s->cpu_slab[cpu] = NULL;
-			}
-		}
-
-		up_read(&slqb_lock);
-#endif
-		break;
 	default:
 		break;
 	}
@@ -2904,9 +2932,8 @@ static void __gather_stats(void *arg)
 	gather->nr_partial += nr_partial;
 	gather->nr_inuse += nr_inuse;
 #ifdef CONFIG_SLQB_STATS
-	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
 		gather->stats[i] += l->stats[i];
-	}
 #endif
 	spin_unlock(&gather->lock);
 }
@@ -2935,9 +2962,8 @@ static void gather_stats(struct kmem_cac
 
 		spin_lock_irqsave(&n->list_lock, flags);
 #ifdef CONFIG_SLQB_STATS
-		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
 			stats->stats[i] += l->stats[i];
-		}
 #endif
 		stats->nr_slabs += l->nr_slabs;
 		stats->nr_partial += l->nr_partial;
@@ -3007,10 +3033,11 @@ static int s_show(struct seq_file *m, vo
 	gather_stats(s, &stats);
 
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
-		   stats.nr_objects, s->size, s->objects, (1 << s->order));
-	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
-	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
-		   0UL);
+			stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+			slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+			stats.nr_slabs, 0UL);
 	seq_putc(m, '\n');
 	return 0;
 }
@@ -3036,7 +3063,8 @@ static const struct file_operations proc
 
 static int __init slab_proc_init(void)
 {
-	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+			&proc_slabinfo_operations);
 	return 0;
 }
 module_init(slab_proc_init);
@@ -3106,7 +3134,9 @@ SLAB_ATTR_RO(ctor);
 static ssize_t slabs_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_slabs);
 }
 SLAB_ATTR_RO(slabs);
@@ -3114,7 +3144,9 @@ SLAB_ATTR_RO(slabs);
 static ssize_t objects_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_inuse);
 }
 SLAB_ATTR_RO(objects);
@@ -3122,7 +3154,9 @@ SLAB_ATTR_RO(objects);
 static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_objects);
 }
 SLAB_ATTR_RO(total_objects);
@@ -3171,7 +3205,8 @@ static ssize_t store_user_show(struct km
 }
 SLAB_ATTR_RO(store_user);
 
-static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t hiwater_store(struct kmem_cache *s,
+				const char *buf, size_t length)
 {
 	long hiwater;
 	int err;
@@ -3194,7 +3229,8 @@ static ssize_t hiwater_show(struct kmem_
 }
 SLAB_ATTR(hiwater);
 
-static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t freebatch_store(struct kmem_cache *s,
+				const char *buf, size_t length)
 {
 	long freebatch;
 	int err;
@@ -3216,6 +3252,7 @@ static ssize_t freebatch_show(struct kme
 	return sprintf(buf, "%d\n", slab_freebatch(s));
 }
 SLAB_ATTR(freebatch);
+
 #ifdef CONFIG_SLQB_STATS
 static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
 {
@@ -3233,8 +3270,9 @@ static int show_stat(struct kmem_cache *
 	for_each_online_cpu(cpu) {
 		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 		struct kmem_cache_list *l = &c->list;
+
 		if (len < PAGE_SIZE - 20)
-			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
 	}
 #endif
 	return len + sprintf(buf + len, "\n");
@@ -3308,8 +3346,7 @@ static struct attribute_group slab_attr_
 };
 
 static ssize_t slab_attr_show(struct kobject *kobj,
-				struct attribute *attr,
-				char *buf)
+				struct attribute *attr, char *buf)
 {
 	struct slab_attribute *attribute;
 	struct kmem_cache *s;
@@ -3327,8 +3364,7 @@ static ssize_t slab_attr_show(struct kob
 }
 
 static ssize_t slab_attr_store(struct kobject *kobj,
-				struct attribute *attr,
-				const char *buf, size_t len)
+			struct attribute *attr, const char *buf, size_t len)
 {
 	struct slab_attribute *attribute;
 	struct kmem_cache *s;
@@ -3396,6 +3432,7 @@ static int sysfs_slab_add(struct kmem_ca
 	err = sysfs_create_group(&s->kobj, &slab_attr_group);
 	if (err)
 		return err;
+
 	kobject_uevent(&s->kobj, KOBJ_ADD);
 
 	return 0;
@@ -3420,17 +3457,20 @@ static int __init slab_sysfs_init(void)
 	}
 
 	down_write(&slqb_lock);
+
 	sysfs_available = 1;
+
 	list_for_each_entry(s, &slab_caches, list) {
 		err = sysfs_slab_add(s);
 		if (err)
 			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
 						" to sysfs\n", s->name);
 	}
+
 	up_write(&slqb_lock);
 
 	return 0;
 }
+device_initcall(slab_sysfs_init);
 
-__initcall(slab_sysfs_init);
 #endif

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 16:56     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-21 16:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> 
> Mind if i nitpick a bit about minor style issues? Since this is going to 
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well here is an incremental patch which should get most of the issues you
pointed out, most of the sane ones that checkpatch pointed out, and a
few of my own ;)

---
 include/linux/slqb_def.h |   90 +++++-----
 mm/slqb.c                |  386 +++++++++++++++++++++++++----------------------
 2 files changed, 261 insertions(+), 215 deletions(-)

Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -37,8 +37,9 @@ enum stat_item {
  * Singly-linked list with head, tail, and nr
  */
 struct kmlist {
-	unsigned long nr;
-	void **head, **tail;
+	unsigned long	nr;
+	void 		**head;
+	void		**tail;
 };
 
 /*
@@ -46,8 +47,8 @@ struct kmlist {
  * objects can be returned to the kmem_cache_list from remote CPUs.
  */
 struct kmem_cache_remote_free {
-	spinlock_t lock;
-	struct kmlist list;
+	spinlock_t	lock;
+	struct kmlist	list;
 } ____cacheline_aligned;
 
 /*
@@ -56,18 +57,23 @@ struct kmem_cache_remote_free {
  * kmem_cache_lists allow off-node allocations (but require locking).
  */
 struct kmem_cache_list {
-	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+				/* Fastpath LIFO freelist of objects */
+	struct kmlist		freelist;
 #ifdef CONFIG_SMP
-	int remote_free_check;	/* remote_free has reached a watermark */
+				/* remote_free has reached a watermark */
+	int			remote_free_check;
 #endif
-	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+				/* kmem_cache corresponding to this list */
+	struct kmem_cache	*cache;
 
-	unsigned long nr_partial; /* Number of partial slabs (pages) */
-	struct list_head partial; /* Slabs which have some free objects */
+				/* Number of partial slabs (pages) */
+	unsigned long		nr_partial;
 
-	unsigned long nr_slabs;	/* Total number of slabs allocated */
+				/* Slabs which have some free objects */
+	struct list_head	partial;
 
-	//struct list_head full;
+				/* Total number of slabs allocated */
+	unsigned long		nr_slabs;
 
 #ifdef CONFIG_SMP
 	/*
@@ -79,7 +85,7 @@ struct kmem_cache_list {
 #endif
 
 #ifdef CONFIG_SLQB_STATS
-	unsigned long stats[NR_SLQB_STAT_ITEMS];
+	unsigned long		stats[NR_SLQB_STAT_ITEMS];
 #endif
 } ____cacheline_aligned;
 
@@ -87,9 +93,8 @@ struct kmem_cache_list {
  * Primary per-cpu, per-kmem_cache structure.
  */
 struct kmem_cache_cpu {
-	struct kmem_cache_list list; /* List for node-local slabs. */
-
-	unsigned int colour_next;
+	struct kmem_cache_list	list;		/* List for node-local slabs */
+	unsigned int		colour_next;	/* Next colour offset to use */
 
 #ifdef CONFIG_SMP
 	/*
@@ -101,53 +106,53 @@ struct kmem_cache_cpu {
 	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
 	 * get to O(NR_CPUS^2) memory consumption situation.
 	 */
-	struct kmlist rlist;
-	struct kmem_cache_list *remote_cache_list;
+	struct kmlist		rlist;
+	struct kmem_cache_list	*remote_cache_list;
 #endif
 } ____cacheline_aligned;
 
 /*
- * Per-node, per-kmem_cache structure.
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
  */
 struct kmem_cache_node {
-	struct kmem_cache_list list;
-	spinlock_t list_lock; /* protects access to list */
+	struct kmem_cache_list	list;
+	spinlock_t		list_lock;	/* protects access to list */
 } ____cacheline_aligned;
 
 /*
  * Management object for a slab cache.
  */
 struct kmem_cache {
-	unsigned long flags;
-	int hiwater;		/* LIFO list high watermark */
-	int freebatch;		/* LIFO freelist batch flush size */
-	int objsize;		/* The size of an object without meta data */
-	int offset;		/* Free pointer offset. */
-	int objects;		/* Number of objects in slab */
-
-	int size;		/* The size of an object including meta data */
-	int order;		/* Allocation order */
-	gfp_t allocflags;	/* gfp flags to use on allocation */
-	unsigned int colour_range;	/* range of colour counter */
-	unsigned int colour_off;		/* offset per colour */
-	void (*ctor)(void *);
+	unsigned long	flags;
+	int		hiwater;	/* LIFO list high watermark */
+	int		freebatch;	/* LIFO freelist batch flush size */
+	int		objsize;	/* Size of object without meta data */
+	int		offset;		/* Free pointer offset. */
+	int		objects;	/* Number of objects in slab */
+
+	int		size;		/* Size of object including meta data */
+	int		order;		/* Allocation order */
+	gfp_t		allocflags;	/* gfp flags to use on allocation */
+	unsigned int	colour_range;	/* range of colour counter */
+	unsigned int	colour_off;	/* offset per colour */
+	void		(*ctor)(void *);
 
-	const char *name;	/* Name (only for display!) */
-	struct list_head list;	/* List of slab caches */
+	const char	*name;		/* Name (only for display!) */
+	struct list_head list;		/* List of slab caches */
 
-	int align;		/* Alignment */
-	int inuse;		/* Offset to metadata */
+	int		align;		/* Alignment */
+	int		inuse;		/* Offset to metadata */
 
 #ifdef CONFIG_SLQB_SYSFS
-	struct kobject kobj;	/* For sysfs */
+	struct kobject	kobj;		/* For sysfs */
 #endif
 #ifdef CONFIG_NUMA
-	struct kmem_cache_node *node[MAX_NUMNODES];
+	struct kmem_cache_node	*node[MAX_NUMNODES];
 #endif
 #ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+	struct kmem_cache_cpu	*cpu_slab[NR_CPUS];
 #else
-	struct kmem_cache_cpu cpu_slab;
+	struct kmem_cache_cpu	cpu_slab;
 #endif
 };
 
@@ -245,7 +250,8 @@ void *__kmalloc(size_t size, gfp_t flags
 #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
 #endif
 
-#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ?	\
+				sizeof(void *) : ARCH_KMALLOC_MINALIGN)
 
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -40,13 +40,13 @@
 struct slqb_page {
 	union {
 		struct {
-			unsigned long flags;	/* mandatory */
-			atomic_t _count;	/* mandatory */
-			unsigned int inuse;	/* Nr of objects */
-		   	struct kmem_cache_list *list; /* Pointer to list */
-			void **freelist;	/* freelist req. slab lock */
+			unsigned long	flags;		/* mandatory */
+			atomic_t	_count;		/* mandatory */
+			unsigned int	inuse;		/* Nr of objects */
+			struct kmem_cache_list *list;	/* Pointer to list */
+			void		 **freelist;	/* LIFO freelist */
 			union {
-				struct list_head lru; /* misc. list */
+				struct list_head lru;	/* misc. list */
 				struct rcu_head rcu_head; /* for rcu freeing */
 			};
 		};
@@ -62,7 +62,7 @@ static int kmem_size __read_mostly;
 #ifdef CONFIG_NUMA
 static int numa_platform __read_mostly;
 #else
-#define numa_platform 0
+static const int numa_platform = 0;
 #endif
 
 static inline int slab_hiwater(struct kmem_cache *s)
@@ -120,15 +120,16 @@ static inline int slab_freebatch(struct
  * - There is no remote free queue. Nodes don't free objects, CPUs do.
  */
 
-static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+				enum stat_item si)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si]++;
 #endif
 }
 
-static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
-					unsigned long nr)
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+				enum stat_item si, unsigned long nr)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si] += nr;
@@ -433,10 +434,11 @@ static void print_page_info(struct slqb_
 
 }
 
+#define MAX_ERR_STR 100
 static void slab_bug(struct kmem_cache *s, char *fmt, ...)
 {
 	va_list args;
-	char buf[100];
+	char buf[MAX_ERR_STR];
 
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
@@ -477,8 +479,7 @@ static void print_trailer(struct kmem_ca
 	print_section("Object", p, min(s->objsize, 128));
 
 	if (s->flags & SLAB_RED_ZONE)
-		print_section("Redzone", p + s->objsize,
-			s->inuse - s->objsize);
+		print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
 
 	if (s->offset)
 		off = s->offset + sizeof(void *);
@@ -488,9 +489,10 @@ static void print_trailer(struct kmem_ca
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
 
-	if (off != s->size)
+	if (off != s->size) {
 		/* Beginning of the filler is the free pointer */
 		print_section("Padding", p + off, s->size - off);
+	}
 
 	dump_stack();
 }
@@ -502,14 +504,9 @@ static void object_err(struct kmem_cache
 	print_trailer(s, page, object);
 }
 
-static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+			char *fmt, ...)
 {
-	va_list args;
-	char buf[100];
-
-	va_start(args, fmt);
-	vsnprintf(buf, sizeof(buf), fmt, args);
-	va_end(args);
 	slab_bug(s, fmt);
 	print_page_info(page);
 	dump_stack();
@@ -524,10 +521,11 @@ static void init_object(struct kmem_cach
 		p[s->objsize - 1] = POISON_END;
 	}
 
-	if (s->flags & SLAB_RED_ZONE)
+	if (s->flags & SLAB_RED_ZONE) {
 		memset(p + s->objsize,
 			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
 			s->inuse - s->objsize);
+	}
 }
 
 static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
@@ -542,7 +540,7 @@ static u8 *check_bytes(u8 *start, unsign
 }
 
 static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
-						void *from, void *to)
+				void *from, void *to)
 {
 	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
 	memset(from, data, to - from);
@@ -610,13 +608,15 @@ static int check_pad_bytes(struct kmem_c
 {
 	unsigned long off = s->inuse;	/* The end of info */
 
-	if (s->offset)
+	if (s->offset) {
 		/* Freepointer is placed after the object. */
 		off += sizeof(void *);
+	}
 
-	if (s->flags & SLAB_STORE_USER)
+	if (s->flags & SLAB_STORE_USER) {
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
+	}
 
 	if (s->size == off)
 		return 1;
@@ -646,6 +646,7 @@ static int slab_pad_check(struct kmem_ca
 	fault = check_bytes(start + length, POISON_INUSE, remainder);
 	if (!fault)
 		return 1;
+
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
@@ -677,12 +678,16 @@ static int check_object(struct kmem_cach
 	}
 
 	if (s->flags & SLAB_POISON) {
-		if (!active && (s->flags & __OBJECT_POISON) &&
-			(!check_bytes_and_report(s, page, p, "Poison", p,
-					POISON_FREE, s->objsize - 1) ||
-			 !check_bytes_and_report(s, page, p, "Poison",
-				p + s->objsize - 1, POISON_END, 1)))
-			return 0;
+		if (!active && (s->flags & __OBJECT_POISON)) {
+			if (!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1))
+				return 0;
+
+			if (!check_bytes_and_report(s, page, p, "Poison",
+					p + s->objsize - 1, POISON_END, 1))
+				return 0;
+		}
+
 		/*
 		 * check_pad_bytes cleans up on its own.
 		 */
@@ -712,7 +717,8 @@ static int check_slab(struct kmem_cache
 	return 1;
 }
 
-static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
 		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
@@ -729,7 +735,7 @@ static void trace(struct kmem_cache *s,
 }
 
 static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
-								void *object)
+				void *object)
 {
 	if (!slab_debug(s))
 		return;
@@ -741,7 +747,8 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int alloc_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
 {
 	struct slqb_page *page;
 	page = virt_to_head_slqb_page(object);
@@ -768,7 +775,8 @@ bad:
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
 {
 	struct slqb_page *page;
 	page = virt_to_head_slqb_page(object);
@@ -799,25 +807,28 @@ fail:
 static int __init setup_slqb_debug(char *str)
 {
 	slqb_debug = DEBUG_DEFAULT_FLAGS;
-	if (*str++ != '=' || !*str)
+	if (*str++ != '=' || !*str) {
 		/*
 		 * No options specified. Switch on full debugging.
 		 */
 		goto out;
+	}
 
-	if (*str == ',')
+	if (*str == ',') {
 		/*
 		 * No options but restriction on slabs. This means full
 		 * debugging for slabs matching a pattern.
 		 */
 		goto check_slabs;
+	}
 
 	slqb_debug = 0;
-	if (*str == '-')
+	if (*str == '-') {
 		/*
 		 * Switch off all debugging measures.
 		 */
 		goto out;
+	}
 
 	/*
 	 * Determine which debug features should be switched on
@@ -855,8 +866,8 @@ out:
 __setup("slqb_debug", setup_slqb_debug);
 
 static unsigned long kmem_cache_flags(unsigned long objsize,
-	unsigned long flags, const char *name,
-	void (*ctor)(void *))
+				unsigned long flags, const char *name,
+				void (*ctor)(void *))
 {
 	/*
 	 * Enable debugging if selected on the kernel commandline.
@@ -870,31 +881,51 @@ static unsigned long kmem_cache_flags(un
 }
 #else
 static inline void setup_object_debug(struct kmem_cache *s,
-			struct slqb_page *page, void *object) {}
+			struct slqb_page *page, void *object)
+{
+}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	void *object, void *addr) { return 0; }
+			void *object, void *addr)
+{
+	return 0;
+}
 
 static inline int free_debug_processing(struct kmem_cache *s,
-	void *object, void *addr) { return 0; }
+			void *object, void *addr)
+{
+	return 0;
+}
 
 static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
-			{ return 1; }
+{
+	return 1;
+}
+
 static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+			void *object, int active)
+{
+	return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name, void (*ctor)(void *))
 {
 	return flags;
 }
-#define slqb_debug 0
+
+static const int slqb_debug = 0;
 #endif
 
 /*
  * allocate a new slab (return its corresponding struct slqb_page)
  */
-static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+					gfp_t flags, int node)
 {
 	struct slqb_page *page;
 	int pages = 1 << s->order;
@@ -916,8 +947,8 @@ static struct slqb_page *allocate_slab(s
 /*
  * Called once for each object on a new slab page
  */
-static void setup_object(struct kmem_cache *s, struct slqb_page *page,
-				void *object)
+static void setup_object(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
@@ -927,7 +958,8 @@ static void setup_object(struct kmem_cac
 /*
  * Allocate a new slab, set up its object list.
  */
-static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+				gfp_t flags, int node, unsigned int colour)
 {
 	struct slqb_page *page;
 	void *start;
@@ -1010,7 +1042,9 @@ static void free_slab(struct kmem_cache
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+static int free_object_to_page(struct kmem_cache *s,
+			struct kmem_cache_list *l, struct slqb_page *page,
+			void *object)
 {
 	VM_BUG_ON(page->list != l);
 
@@ -1027,6 +1061,7 @@ static int free_object_to_page(struct km
 		free_slab(s, page);
 		slqb_stat_inc(l, FLUSH_SLAB_FREE);
 		return 1;
+
 	} else if (page->inuse + 1 == s->objects) {
 		l->nr_partial++;
 		list_add(&page->lru, &l->partial);
@@ -1037,7 +1072,8 @@ static int free_object_to_page(struct km
 }
 
 #ifdef CONFIG_SMP
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+				void *object, struct kmem_cache_cpu *c);
 #endif
 
 /*
@@ -1110,7 +1146,8 @@ static void flush_free_list_all(struct k
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void claim_remote_free_list(struct kmem_cache *s,
+					struct kmem_cache_list *l)
 {
 	void **head, **tail;
 	int nr;
@@ -1126,11 +1163,13 @@ static void claim_remote_free_list(struc
 	prefetchw(head);
 
 	spin_lock(&l->remote_free.lock);
+
 	l->remote_free.list.head = NULL;
 	tail = l->remote_free.list.tail;
 	l->remote_free.list.tail = NULL;
 	nr = l->remote_free.list.nr;
 	l->remote_free.list.nr = 0;
+
 	spin_unlock(&l->remote_free.lock);
 
 	if (!l->freelist.nr)
@@ -1153,18 +1192,19 @@ static void claim_remote_free_list(struc
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+						struct kmem_cache_list *l)
 {
 	void *object;
 
 	object = l->freelist.head;
 	if (likely(object)) {
 		void *next = get_freepointer(s, object);
+
 		VM_BUG_ON(!l->freelist.nr);
 		l->freelist.nr--;
 		l->freelist.head = next;
-//		if (next)
-//			prefetchw(next);
+
 		return object;
 	}
 	VM_BUG_ON(l->freelist.nr);
@@ -1180,11 +1220,11 @@ static __always_inline void *__cache_lis
 		object = l->freelist.head;
 		if (likely(object)) {
 			void *next = get_freepointer(s, object);
+
 			VM_BUG_ON(!l->freelist.nr);
 			l->freelist.nr--;
 			l->freelist.head = next;
-//			if (next)
-//				prefetchw(next);
+
 			return object;
 		}
 		VM_BUG_ON(l->freelist.nr);
@@ -1203,7 +1243,8 @@ static __always_inline void *__cache_lis
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+				struct kmem_cache_list *l)
 {
 	struct slqb_page *page;
 	void *object;
@@ -1216,15 +1257,12 @@ static noinline void *__cache_list_get_p
 	if (page->inuse + 1 == s->objects) {
 		l->nr_partial--;
 		list_del(&page->lru);
-/*XXX		list_move(&page->lru, &l->full); */
 	}
 
 	VM_BUG_ON(!page->freelist);
 
 	page->inuse++;
 
-//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
-
 	object = page->freelist;
 	page->freelist = get_freepointer(s, object);
 	if (page->freelist)
@@ -1244,7 +1282,8 @@ static noinline void *__cache_list_get_p
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
 {
 	struct slqb_page *page;
 	struct kmem_cache_list *l;
@@ -1285,8 +1324,8 @@ static noinline void *__slab_alloc_page(
 		slqb_stat_inc(l, ALLOC);
 		slqb_stat_inc(l, ALLOC_SLAB_NEW);
 		object = __cache_list_get_page(s, l);
-#ifdef CONFIG_NUMA
 	} else {
+#ifdef CONFIG_NUMA
 		struct kmem_cache_node *n;
 
 		n = s->node[slqb_page_to_nid(page)];
@@ -1308,7 +1347,8 @@ static noinline void *__slab_alloc_page(
 }
 
 #ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline int alternate_nid(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
 {
 	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
 		return node;
@@ -1326,7 +1366,7 @@ static noinline int alternate_nid(struct
  * Must be called with interrupts disabled.
  */
 static noinline void *__remote_slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node)
+				gfp_t gfpflags, int node)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache_list *l;
@@ -1337,9 +1377,6 @@ static noinline void *__remote_slab_allo
 		return NULL;
 	l = &n->list;
 
-//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
-//		return NULL;
-
 	spin_lock(&n->list_lock);
 
 	object = __cache_list_get_object(s, l);
@@ -1363,7 +1400,7 @@ static noinline void *__remote_slab_allo
  * Must be called with interrupts disabled.
  */
 static __always_inline void *__slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node)
+				gfp_t gfpflags, int node)
 {
 	void *object;
 	struct kmem_cache_cpu *c;
@@ -1393,7 +1430,7 @@ static __always_inline void *__slab_allo
  * (debug checking and memset()ing).
  */
 static __always_inline void *slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node, void *addr)
+				gfp_t gfpflags, int node, void *addr)
 {
 	void *object;
 	unsigned long flags;
@@ -1414,7 +1451,8 @@ again:
 	return object;
 }
 
-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, void *caller)
 {
 	int node = -1;
 #ifdef CONFIG_NUMA
@@ -1449,7 +1487,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  *
  * Must be called with interrupts disabled.
  */
-static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static void flush_remote_free_cache(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
 {
 	struct kmlist *src;
 	struct kmem_cache_list *dst;
@@ -1464,6 +1503,7 @@ static void flush_remote_free_cache(stru
 #ifdef CONFIG_SLQB_STATS
 	{
 		struct kmem_cache_list *l = &c->list;
+
 		slqb_stat_inc(l, FLUSH_RFREE_LIST);
 		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
 	}
@@ -1472,6 +1512,7 @@ static void flush_remote_free_cache(stru
 	dst = c->remote_cache_list;
 
 	spin_lock(&dst->remote_free.lock);
+
 	if (!dst->remote_free.list.head)
 		dst->remote_free.list.head = src->head;
 	else
@@ -1500,7 +1541,9 @@ static void flush_remote_free_cache(stru
  *
  * Must be called with interrupts disabled.
  */
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+				struct slqb_page *page, void *object,
+				struct kmem_cache_cpu *c)
 {
 	struct kmlist *r;
 
@@ -1526,14 +1569,14 @@ static noinline void slab_free_to_remote
 		flush_remote_free_cache(s, c);
 }
 #endif
- 
+
 /*
  * Main freeing path. Return an object, or NULL on allocation failure.
  *
  * Must be called with interrupts disabled.
  */
 static __always_inline void __slab_free(struct kmem_cache *s,
-		struct slqb_page *page, void *object)
+				struct slqb_page *page, void *object)
 {
 	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
@@ -1561,8 +1604,8 @@ static __always_inline void __slab_free(
 		if (unlikely(l->freelist.nr > slab_hiwater(s)))
 			flush_free_list(s, l);
 
-#ifdef CONFIG_NUMA
 	} else {
+#ifdef CONFIG_NUMA
 		/*
 		 * Freeing an object that was allocated on a remote node.
 		 */
@@ -1577,7 +1620,7 @@ static __always_inline void __slab_free(
  * (debug checking).
  */
 static __always_inline void slab_free(struct kmem_cache *s,
-		struct slqb_page *page, void *object)
+				struct slqb_page *page, void *object)
 {
 	unsigned long flags;
 
@@ -1597,6 +1640,7 @@ static __always_inline void slab_free(st
 void kmem_cache_free(struct kmem_cache *s, void *object)
 {
 	struct slqb_page *page = NULL;
+
 	if (numa_platform)
 		page = virt_to_head_slqb_page(object);
 	slab_free(s, page, object);
@@ -1610,7 +1654,7 @@ EXPORT_SYMBOL(kmem_cache_free);
  * in the page allocator, and they have fastpaths in the page allocator. But
  * also minimise external fragmentation with large objects.
  */
-static inline int slab_order(int size, int max_order, int frac)
+static int slab_order(int size, int max_order, int frac)
 {
 	int order;
 
@@ -1618,6 +1662,7 @@ static inline int slab_order(int size, i
 		order = 0;
 	else
 		order = fls(size - 1) - PAGE_SHIFT;
+
 	while (order <= max_order) {
 		unsigned long slab_size = PAGE_SIZE << order;
 		unsigned long objects;
@@ -1638,7 +1683,7 @@ static inline int slab_order(int size, i
 	return order;
 }
 
-static inline int calculate_order(int size)
+static int calculate_order(int size)
 {
 	int order;
 
@@ -1666,7 +1711,7 @@ static inline int calculate_order(int si
  * Figure out what the alignment of the objects will be.
  */
 static unsigned long calculate_alignment(unsigned long flags,
-		unsigned long align, unsigned long size)
+				unsigned long align, unsigned long size)
 {
 	/*
 	 * If the user wants hardware cache aligned objects then follow that
@@ -1677,6 +1722,7 @@ static unsigned long calculate_alignment
 	 */
 	if (flags & SLAB_HWCACHE_ALIGN) {
 		unsigned long ralign = cache_line_size();
+
 		while (size <= ralign / 2)
 			ralign /= 2;
 		align = max(align, ralign);
@@ -1688,21 +1734,21 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void init_kmem_cache_list(struct kmem_cache *s,
+				struct kmem_cache_list *l)
 {
-	l->cache = s;
-	l->freelist.nr = 0;
-	l->freelist.head = NULL;
-	l->freelist.tail = NULL;
-	l->nr_partial = 0;
-	l->nr_slabs = 0;
+	l->cache		= s;
+	l->freelist.nr		= 0;
+	l->freelist.head	= NULL;
+	l->freelist.tail	= NULL;
+	l->nr_partial		= 0;
+	l->nr_slabs		= 0;
 	INIT_LIST_HEAD(&l->partial);
-//	INIT_LIST_HEAD(&l->full);
 
 #ifdef CONFIG_SMP
-	l->remote_free_check = 0;
+	l->remote_free_check	= 0;
 	spin_lock_init(&l->remote_free.lock);
-	l->remote_free.list.nr = 0;
+	l->remote_free.list.nr	= 0;
 	l->remote_free.list.head = NULL;
 	l->remote_free.list.tail = NULL;
 #endif
@@ -1713,21 +1759,22 @@ static void init_kmem_cache_list(struct
 }
 
 static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
+				struct kmem_cache_cpu *c)
 {
 	init_kmem_cache_list(s, &c->list);
 
-	c->colour_next = 0;
+	c->colour_next		= 0;
 #ifdef CONFIG_SMP
-	c->rlist.nr = 0;
-	c->rlist.head = NULL;
-	c->rlist.tail = NULL;
-	c->remote_cache_list = NULL;
+	c->rlist.nr		= 0;
+	c->rlist.head		= NULL;
+	c->rlist.tail		= NULL;
+	c->remote_cache_list	= NULL;
 #endif
 }
 
 #ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+static void init_kmem_cache_node(struct kmem_cache *s,
+				struct kmem_cache_node *n)
 {
 	spin_lock_init(&n->list_lock);
 	init_kmem_cache_list(s, &n->list);
@@ -1757,7 +1804,8 @@ static struct kmem_cache_node kmem_node_
 #endif
 
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+				int cpu)
 {
 	struct kmem_cache_cpu *c;
 
@@ -1918,14 +1966,15 @@ static int calculate_sizes(struct kmem_c
 	}
 
 #ifdef CONFIG_SLQB_DEBUG
-	if (flags & SLAB_STORE_USER)
+	if (flags & SLAB_STORE_USER) {
 		/*
 		 * Need to store information about allocs and frees after
 		 * the object.
 		 */
 		size += 2 * sizeof(struct track);
+	}
 
-	if (flags & SLAB_RED_ZONE)
+	if (flags & SLAB_RED_ZONE) {
 		/*
 		 * Add some empty padding so that we can catch
 		 * overwrites from earlier objects rather than let
@@ -1934,6 +1983,7 @@ static int calculate_sizes(struct kmem_c
 		 * of the object.
 		 */
 		size += sizeof(void *);
+	}
 #endif
 
 	/*
@@ -1970,7 +2020,8 @@ static int calculate_sizes(struct kmem_c
 	 */
 	s->objects = (PAGE_SIZE << s->order) / size;
 
-	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	s->freebatch = max(4UL*PAGE_SIZE / size,
+				min(256UL, 64*PAGE_SIZE / size));
 	if (!s->freebatch)
 		s->freebatch = 1;
 	s->hiwater = s->freebatch << 2;
@@ -1980,9 +2031,8 @@ static int calculate_sizes(struct kmem_c
 }
 
 static int kmem_cache_open(struct kmem_cache *s,
-		const char *name, size_t size,
-		size_t align, unsigned long flags,
-		void (*ctor)(void *), int alloc)
+			const char *name, size_t size, size_t align,
+			unsigned long flags, void (*ctor)(void *), int alloc)
 {
 	unsigned int left_over;
 
@@ -2024,7 +2074,7 @@ error_nodes:
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
-		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
 	return 0;
 }
 
@@ -2141,7 +2191,7 @@ EXPORT_SYMBOL(kmalloc_caches_dma);
 #endif
 
 static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
-		const char *name, int size, gfp_t gfp_flags)
+				const char *name, int size, gfp_t gfp_flags)
 {
 	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
 
@@ -2446,10 +2496,10 @@ static int __init cpucache_init(void)
 
 	for_each_online_cpu(cpu)
 		start_cpu_timer(cpu);
+
 	return 0;
 }
-__initcall(cpucache_init);
-
+device_initcall(cpucache_init);
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
 static void slab_mem_going_offline_callback(void *arg)
@@ -2459,29 +2509,7 @@ static void slab_mem_going_offline_callb
 
 static void slab_mem_offline_callback(void *arg)
 {
-	struct kmem_cache *s;
-	struct memory_notify *marg = arg;
-	int nid = marg->status_change_nid;
-
-	/*
-	 * If the node still has available memory. we need kmem_cache_node
-	 * for it yet.
-	 */
-	if (nid < 0)
-		return;
-
-#if 0 // XXX: see cpu offline comment
-	down_read(&slqb_lock);
-	list_for_each_entry(s, &slab_caches, list) {
-		struct kmem_cache_node *n;
-		n = s->node[nid];
-		if (n) {
-			s->node[nid] = NULL;
-			kmem_cache_free(&kmem_node_cache, n);
-		}
-	}
-	up_read(&slqb_lock);
-#endif
+	/* XXX: should release structures, see CPU offline comment */
 }
 
 static int slab_mem_going_online_callback(void *arg)
@@ -2562,6 +2590,10 @@ void __init kmem_cache_init(void)
 	int i;
 	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
 
+	/*
+	 * All the ifdefs are rather ugly here, but it's just the setup code,
+	 * so it doesn't have to be too readable :)
+	 */
 #ifdef CONFIG_NUMA
 	if (num_possible_nodes() == 1)
 		numa_platform = 0;
@@ -2576,12 +2608,15 @@ void __init kmem_cache_init(void)
 	kmem_size = sizeof(struct kmem_cache);
 #endif
 
-	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+			kmem_size, 0, flags, NULL, 0);
 #ifdef CONFIG_SMP
-	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+			sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
 #endif
 #ifdef CONFIG_NUMA
-	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+			sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
 #endif
 
 #ifdef CONFIG_SMP
@@ -2634,14 +2669,13 @@ void __init kmem_cache_init(void)
 
 	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
 		open_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, GFP_KERNEL);
+				"kmalloc", 1 << i, GFP_KERNEL);
 #ifdef CONFIG_ZONE_DMA
 		open_kmalloc_cache(&kmalloc_caches_dma[i],
 				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
 #endif
 	}
 
-
 	/*
 	 * Patch up the size_index table if we have strange large alignment
 	 * requirements for the kmalloc array. This is only the case for
@@ -2697,10 +2731,12 @@ static int kmem_cache_create_ok(const ch
 		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
 				name);
 		dump_stack();
+
 		return 0;
 	}
 
 	down_read(&slqb_lock);
+
 	list_for_each_entry(tmp, &slab_caches, list) {
 		char x;
 		int res;
@@ -2723,9 +2759,11 @@ static int kmem_cache_create_ok(const ch
 			       "kmem_cache_create(): duplicate cache %s\n", name);
 			dump_stack();
 			up_read(&slqb_lock);
+
 			return 0;
 		}
 	}
+
 	up_read(&slqb_lock);
 
 	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
@@ -2754,7 +2792,8 @@ struct kmem_cache *kmem_cache_create(con
 
 err:
 	if (flags & SLAB_PANIC)
-		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
 	return NULL;
 }
 EXPORT_SYMBOL(kmem_cache_create);
@@ -2765,7 +2804,7 @@ EXPORT_SYMBOL(kmem_cache_create);
  * necessary.
  */
 static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action, void *hcpu)
+				unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	struct kmem_cache *s;
@@ -2803,23 +2842,12 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-#if 0
-		down_read(&slqb_lock);
-		/* XXX: this doesn't work because objects can still be on this
-		 * CPU's list. periodic timer needs to check if a CPU is offline
-		 * and then try to cleanup from there. Same for node offline.
+		/*
+		 * XXX: Freeing here doesn't work because objects can still be
+		 * on this CPU's list. periodic timer needs to check if a CPU
+		 * is offline and then try to cleanup from there. Same for node
+		 * offline.
 		 */
-		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-			if (c) {
-				kmem_cache_free(&kmem_cpu_cache, c);
-				s->cpu_slab[cpu] = NULL;
-			}
-		}
-
-		up_read(&slqb_lock);
-#endif
-		break;
 	default:
 		break;
 	}
@@ -2904,9 +2932,8 @@ static void __gather_stats(void *arg)
 	gather->nr_partial += nr_partial;
 	gather->nr_inuse += nr_inuse;
 #ifdef CONFIG_SLQB_STATS
-	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
 		gather->stats[i] += l->stats[i];
-	}
 #endif
 	spin_unlock(&gather->lock);
 }
@@ -2935,9 +2962,8 @@ static void gather_stats(struct kmem_cac
 
 		spin_lock_irqsave(&n->list_lock, flags);
 #ifdef CONFIG_SLQB_STATS
-		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
 			stats->stats[i] += l->stats[i];
-		}
 #endif
 		stats->nr_slabs += l->nr_slabs;
 		stats->nr_partial += l->nr_partial;
@@ -3007,10 +3033,11 @@ static int s_show(struct seq_file *m, vo
 	gather_stats(s, &stats);
 
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
-		   stats.nr_objects, s->size, s->objects, (1 << s->order));
-	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
-	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
-		   0UL);
+			stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+			slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+			stats.nr_slabs, 0UL);
 	seq_putc(m, '\n');
 	return 0;
 }
@@ -3036,7 +3063,8 @@ static const struct file_operations proc
 
 static int __init slab_proc_init(void)
 {
-	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+			&proc_slabinfo_operations);
 	return 0;
 }
 module_init(slab_proc_init);
@@ -3106,7 +3134,9 @@ SLAB_ATTR_RO(ctor);
 static ssize_t slabs_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_slabs);
 }
 SLAB_ATTR_RO(slabs);
@@ -3114,7 +3144,9 @@ SLAB_ATTR_RO(slabs);
 static ssize_t objects_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_inuse);
 }
 SLAB_ATTR_RO(objects);
@@ -3122,7 +3154,9 @@ SLAB_ATTR_RO(objects);
 static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
 {
 	struct stats_gather stats;
+
 	gather_stats(s, &stats);
+
 	return sprintf(buf, "%lu\n", stats.nr_objects);
 }
 SLAB_ATTR_RO(total_objects);
@@ -3171,7 +3205,8 @@ static ssize_t store_user_show(struct km
 }
 SLAB_ATTR_RO(store_user);
 
-static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t hiwater_store(struct kmem_cache *s,
+				const char *buf, size_t length)
 {
 	long hiwater;
 	int err;
@@ -3194,7 +3229,8 @@ static ssize_t hiwater_show(struct kmem_
 }
 SLAB_ATTR(hiwater);
 
-static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t freebatch_store(struct kmem_cache *s,
+				const char *buf, size_t length)
 {
 	long freebatch;
 	int err;
@@ -3216,6 +3252,7 @@ static ssize_t freebatch_show(struct kme
 	return sprintf(buf, "%d\n", slab_freebatch(s));
 }
 SLAB_ATTR(freebatch);
+
 #ifdef CONFIG_SLQB_STATS
 static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
 {
@@ -3233,8 +3270,9 @@ static int show_stat(struct kmem_cache *
 	for_each_online_cpu(cpu) {
 		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 		struct kmem_cache_list *l = &c->list;
+
 		if (len < PAGE_SIZE - 20)
-			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
 	}
 #endif
 	return len + sprintf(buf + len, "\n");
@@ -3308,8 +3346,7 @@ static struct attribute_group slab_attr_
 };
 
 static ssize_t slab_attr_show(struct kobject *kobj,
-				struct attribute *attr,
-				char *buf)
+				struct attribute *attr, char *buf)
 {
 	struct slab_attribute *attribute;
 	struct kmem_cache *s;
@@ -3327,8 +3364,7 @@ static ssize_t slab_attr_show(struct kob
 }
 
 static ssize_t slab_attr_store(struct kobject *kobj,
-				struct attribute *attr,
-				const char *buf, size_t len)
+			struct attribute *attr, const char *buf, size_t len)
 {
 	struct slab_attribute *attribute;
 	struct kmem_cache *s;
@@ -3396,6 +3432,7 @@ static int sysfs_slab_add(struct kmem_ca
 	err = sysfs_create_group(&s->kobj, &slab_attr_group);
 	if (err)
 		return err;
+
 	kobject_uevent(&s->kobj, KOBJ_ADD);
 
 	return 0;
@@ -3420,17 +3457,20 @@ static int __init slab_sysfs_init(void)
 	}
 
 	down_write(&slqb_lock);
+
 	sysfs_available = 1;
+
 	list_for_each_entry(s, &slab_caches, list) {
 		err = sysfs_slab_add(s);
 		if (err)
 			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
 						" to sysfs\n", s->name);
 	}
+
 	up_write(&slqb_lock);
 
 	return 0;
 }
+device_initcall(slab_sysfs_init);
 
-__initcall(slab_sysfs_init);
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 16:56     ` Nick Piggin
@ 2009-01-21 17:40       ` Ingo Molnar
  -1 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-21 17:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> > 
> > Mind if i nitpick a bit about minor style issues? Since this is going to 
> > be the next Linux SLAB allocator we might as well do it perfectly :-)
> 
> Well here is an incremental patch which should get most of the issues 
> you pointed out, most of the sane ones that checkpatch pointed out, and 
> a few of my own ;)

here's an incremental one ontop of your incremental patch, enhancing some 
more issues. I now find the code very readable! :-)

( in case you are wondering about the placement of bit_spinlock.h - that 
  file needs fixing, just move it to the top of the file and see the build 
  break. But that's a separate patch.)

	Ingo

------------------->
Subject: slbq: cleanup
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Jan 21 18:10:20 CET 2009

mm/slqb.o:

   text	   data	    bss	    dec	    hex	filename
  17655	  54159	 200456	 272270	  4278e	slqb.o.before
  17653	  54159	 200456	 272268	  4278c	slqb.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 mm/slqb.c |  588 ++++++++++++++++++++++++++++++++------------------------------
 1 file changed, 308 insertions(+), 280 deletions(-)

Index: linux/mm/slqb.c
===================================================================
--- linux.orig/mm/slqb.c
+++ linux/mm/slqb.c
@@ -7,19 +7,20 @@
  * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
  */
 
-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bit_spinlock.h>
 #include <linux/interrupt.h>
-#include <linux/bitops.h>
-#include <linux/slab.h>
-#include <linux/seq_file.h>
-#include <linux/cpu.h>
-#include <linux/cpuset.h>
 #include <linux/mempolicy.h>
-#include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include <linux/seq_file.h>
+#include <linux/bitops.h>
+#include <linux/cpuset.h>
 #include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/mm.h>
+
+#include <linux/bit_spinlock.h>
 
 /*
  * TODO
@@ -40,14 +41,14 @@
 struct slqb_page {
 	union {
 		struct {
-			unsigned long	flags;		/* mandatory */
-			atomic_t	_count;		/* mandatory */
-			unsigned int	inuse;		/* Nr of objects */
+			unsigned long	flags;		/* mandatory	   */
+			atomic_t	_count;		/* mandatory	   */
+			unsigned int	inuse;		/* Nr of objects   */
 			struct kmem_cache_list *list;	/* Pointer to list */
-			void		 **freelist;	/* LIFO freelist */
+			void		 **freelist;	/* LIFO freelist   */
 			union {
-				struct list_head lru;	/* misc. list */
-				struct rcu_head rcu_head; /* for rcu freeing */
+				struct list_head lru;	/* misc. list	   */
+				struct rcu_head	rcu_head; /* for rcu freeing */
 			};
 		};
 		struct page page;
@@ -120,16 +121,16 @@ static inline int slab_freebatch(struct 
  * - There is no remote free queue. Nodes don't free objects, CPUs do.
  */
 
-static inline void slqb_stat_inc(struct kmem_cache_list *list,
-				enum stat_item si)
+static inline void
+slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si]++;
 #endif
 }
 
-static inline void slqb_stat_add(struct kmem_cache_list *list,
-				enum stat_item si, unsigned long nr)
+static inline void
+slqb_stat_add(struct kmem_cache_list *list, enum stat_item si, unsigned long nr)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si] += nr;
@@ -196,12 +197,12 @@ static inline void __free_slqb_pages(str
 #ifdef CONFIG_SLQB_DEBUG
 static inline int slab_debug(struct kmem_cache *s)
 {
-	return (s->flags &
+	return s->flags &
 			(SLAB_DEBUG_FREE |
 			 SLAB_RED_ZONE |
 			 SLAB_POISON |
 			 SLAB_STORE_USER |
-			 SLAB_TRACE));
+			 SLAB_TRACE);
 }
 static inline int slab_poison(struct kmem_cache *s)
 {
@@ -574,34 +575,34 @@ static int check_bytes_and_report(struct
  * Object layout:
  *
  * object address
- * 	Bytes of the object to be managed.
- * 	If the freepointer may overlay the object then the free
- * 	pointer is the first word of the object.
+ *	Bytes of the object to be managed.
+ *	If the freepointer may overlay the object then the free
+ *	pointer is the first word of the object.
  *
- * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
- * 	0xa5 (POISON_END)
+ *	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ *	0xa5 (POISON_END)
  *
  * object + s->objsize
- * 	Padding to reach word boundary. This is also used for Redzoning.
- * 	Padding is extended by another word if Redzoning is enabled and
- * 	objsize == inuse.
+ *	Padding to reach word boundary. This is also used for Redzoning.
+ *	Padding is extended by another word if Redzoning is enabled and
+ *	objsize == inuse.
  *
- * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
- * 	0xcc (RED_ACTIVE) for objects in use.
+ *	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ *	0xcc (RED_ACTIVE) for objects in use.
  *
  * object + s->inuse
- * 	Meta data starts here.
+ *	Meta data starts here.
  *
- * 	A. Free pointer (if we cannot overwrite object on free)
- * 	B. Tracking data for SLAB_STORE_USER
- * 	C. Padding to reach required alignment boundary or at mininum
- * 		one word if debuggin is on to be able to detect writes
- * 		before the word boundary.
+ *	A. Free pointer (if we cannot overwrite object on free)
+ *	B. Tracking data for SLAB_STORE_USER
+ *	C. Padding to reach required alignment boundary or at mininum
+ *		one word if debuggin is on to be able to detect writes
+ *		before the word boundary.
  *
  *	Padding is done using 0x5a (POISON_INUSE)
  *
  * object + s->size
- * 	Nothing is used beyond s->size.
+ *	Nothing is used beyond s->size.
  */
 
 static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
@@ -717,25 +718,26 @@ static int check_slab(struct kmem_cache 
 	return 1;
 }
 
-static void trace(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int alloc)
+static void
+trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
 {
-	if (s->flags & SLAB_TRACE) {
-		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
-			s->name,
-			alloc ? "alloc" : "free",
-			object, page->inuse,
-			page->freelist);
+	if (likely(!(s->flags & SLAB_TRACE)))
+		return;
 
-		if (!alloc)
-			print_section("Object", (void *)object, s->objsize);
+	printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+		s->name,
+		alloc ? "alloc" : "free",
+		object, page->inuse,
+		page->freelist);
 
-		dump_stack();
-	}
+	if (!alloc)
+		print_section("Object", (void *)object, s->objsize);
+
+	dump_stack();
 }
 
-static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
-				void *object)
+static void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	if (!slab_debug(s))
 		return;
@@ -747,11 +749,10 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s,
-					void *object, void *addr)
+static int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
-	struct slqb_page *page;
-	page = virt_to_head_slqb_page(object);
+	struct slqb_page *page = virt_to_head_slqb_page(object);
 
 	if (!check_slab(s, page))
 		goto bad;
@@ -767,6 +768,7 @@ static int alloc_debug_processing(struct
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
+
 	trace(s, page, object, 1);
 	init_object(s, object, 1);
 	return 1;
@@ -775,11 +777,9 @@ bad:
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s,
-					void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
-	struct slqb_page *page;
-	page = virt_to_head_slqb_page(object);
+	struct slqb_page *page = virt_to_head_slqb_page(object);
 
 	if (!check_slab(s, page))
 		goto fail;
@@ -870,29 +870,34 @@ static unsigned long kmem_cache_flags(un
 				void (*ctor)(void *))
 {
 	/*
-	 * Enable debugging if selected on the kernel commandline.
+	 * Enable debugging if selected on the kernel commandline:
 	 */
-	if (slqb_debug && (!slqb_debug_slabs ||
-	    strncmp(slqb_debug_slabs, name,
-		strlen(slqb_debug_slabs)) == 0))
-			flags |= slqb_debug;
+
+	if (!slqb_debug)
+		return flags;
+
+	if (slqb_debug_slabs)
+		return flags | slqb_debug;
+
+	if (!strncmp(slqb_debug_slabs, name, strlen(slqb_debug_slabs)))
+		return flags | slqb_debug;
 
 	return flags;
 }
 #else
-static inline void setup_object_debug(struct kmem_cache *s,
-			struct slqb_page *page, void *object)
+static inline void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 }
 
-static inline int alloc_debug_processing(struct kmem_cache *s,
-			void *object, void *addr)
+static inline int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
 	return 0;
 }
 
-static inline int free_debug_processing(struct kmem_cache *s,
-			void *object, void *addr)
+static inline int
+free_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
 	return 0;
 }
@@ -903,7 +908,7 @@ static inline int slab_pad_check(struct 
 }
 
 static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int active)
+			       void *object, int active)
 {
 	return 1;
 }
@@ -924,11 +929,11 @@ static const int slqb_debug = 0;
 /*
  * allocate a new slab (return its corresponding struct slqb_page)
  */
-static struct slqb_page *allocate_slab(struct kmem_cache *s,
-					gfp_t flags, int node)
+static struct slqb_page *
+allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct slqb_page *page;
 	int pages = 1 << s->order;
+	struct slqb_page *page;
 
 	flags |= s->allocflags;
 
@@ -947,8 +952,8 @@ static struct slqb_page *allocate_slab(s
 /*
  * Called once for each object on a new slab page
  */
-static void setup_object(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static void
+setup_object(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
@@ -958,8 +963,8 @@ static void setup_object(struct kmem_cac
 /*
  * Allocate a new slab, set up its object list.
  */
-static struct slqb_page *new_slab_page(struct kmem_cache *s,
-				gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *
+new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
 {
 	struct slqb_page *page;
 	void *start;
@@ -1030,6 +1035,7 @@ static void rcu_free_slab(struct rcu_hea
 static void free_slab(struct kmem_cache *s, struct slqb_page *page)
 {
 	VM_BUG_ON(page->inuse);
+
 	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
 		call_rcu(&page->rcu_head, rcu_free_slab);
 	else
@@ -1060,12 +1066,14 @@ static int free_object_to_page(struct km
 		l->nr_slabs--;
 		free_slab(s, page);
 		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+
 		return 1;
 
 	} else if (page->inuse + 1 == s->objects) {
 		l->nr_partial++;
 		list_add(&page->lru, &l->partial);
 		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+
 		return 0;
 	}
 	return 0;
@@ -1146,8 +1154,8 @@ static void flush_free_list_all(struct k
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static void claim_remote_free_list(struct kmem_cache *s,
-					struct kmem_cache_list *l)
+static void
+claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	void **head, **tail;
 	int nr;
@@ -1192,8 +1200,8 @@ static void claim_remote_free_list(struc
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
-						struct kmem_cache_list *l)
+static __always_inline void *
+__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	void *object;
 
@@ -1243,8 +1251,8 @@ static __always_inline void *__cache_lis
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static noinline void *__cache_list_get_page(struct kmem_cache *s,
-				struct kmem_cache_list *l)
+static noinline void *
+__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	struct slqb_page *page;
 	void *object;
@@ -1282,12 +1290,12 @@ static noinline void *__cache_list_get_p
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__slab_alloc_page(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline void *
+__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	struct slqb_page *page;
 	struct kmem_cache_list *l;
 	struct kmem_cache_cpu *c;
+	struct slqb_page *page;
 	unsigned int colour;
 	void *object;
 
@@ -1347,15 +1355,19 @@ static noinline void *__slab_alloc_page(
 }
 
 #ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline int
+alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
 	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
 		return node;
-	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD)) {
 		return cpuset_mem_spread_node();
-	else if (current->mempolicy)
-		return slab_node(current->mempolicy);
+	} else {
+		if (current->mempolicy)
+			return slab_node(current->mempolicy);
+	}
+
 	return node;
 }
 
@@ -1365,8 +1377,8 @@ static noinline int alternate_nid(struct
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__remote_slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline void *
+__remote_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache_list *l;
@@ -1375,6 +1387,7 @@ static noinline void *__remote_slab_allo
 	n = s->node[node];
 	if (unlikely(!n)) /* node has no memory */
 		return NULL;
+
 	l = &n->list;
 
 	spin_lock(&n->list_lock);
@@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
 	}
 	if (likely(object))
 		slqb_stat_inc(l, ALLOC);
+
 	spin_unlock(&n->list_lock);
+
 	return object;
 }
 #endif
@@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
  *
  * Must be called with interrupts disabled.
  */
-static __always_inline void *__slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static __always_inline void *
+__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	void *object;
-	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	void *object;
 
 #ifdef CONFIG_NUMA
 	if (unlikely(node != -1) && unlikely(node != numa_node_id()))
@@ -1422,6 +1437,7 @@ static __always_inline void *__slab_allo
 	}
 	if (likely(object))
 		slqb_stat_inc(l, ALLOC);
+
 	return object;
 }
 
@@ -1429,11 +1445,11 @@ static __always_inline void *__slab_allo
  * Perform some interrupts-on processing around the main allocation path
  * (debug checking and memset()ing).
  */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node, void *addr)
+static __always_inline void *
+slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
 {
-	void *object;
 	unsigned long flags;
+	void *object;
 
 again:
 	local_irq_save(flags);
@@ -1451,10 +1467,11 @@ again:
 	return object;
 }
 
-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, void *caller)
+static __always_inline void *
+__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
 {
 	int node = -1;
+
 #ifdef CONFIG_NUMA
 	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
 		node = alternate_nid(s, gfpflags, node);
@@ -1487,8 +1504,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  *
  * Must be called with interrupts disabled.
  */
-static void flush_remote_free_cache(struct kmem_cache *s,
-				struct kmem_cache_cpu *c)
+static void
+flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	struct kmlist *src;
 	struct kmem_cache_list *dst;
@@ -1575,12 +1592,12 @@ static noinline void slab_free_to_remote
  *
  * Must be called with interrupts disabled.
  */
-static __always_inline void __slab_free(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static __always_inline void
+__slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
-	struct kmem_cache_cpu *c;
-	struct kmem_cache_list *l;
 	int thiscpu = smp_processor_id();
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
 
 	c = get_cpu_slab(s, thiscpu);
 	l = &c->list;
@@ -1619,8 +1636,8 @@ static __always_inline void __slab_free(
  * Perform some interrupts-on processing around the main freeing path
  * (debug checking).
  */
-static __always_inline void slab_free(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static __always_inline void
+slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	unsigned long flags;
 
@@ -1683,7 +1700,7 @@ static int slab_order(int size, int max_
 	return order;
 }
 
-static int calculate_order(int size)
+static int calc_order(int size)
 {
 	int order;
 
@@ -1710,8 +1727,8 @@ static int calculate_order(int size)
 /*
  * Figure out what the alignment of the objects will be.
  */
-static unsigned long calculate_alignment(unsigned long flags,
-				unsigned long align, unsigned long size)
+static unsigned long
+calc_alignment(unsigned long flags, unsigned long align, unsigned long size)
 {
 	/*
 	 * If the user wants hardware cache aligned objects then follow that
@@ -1737,18 +1754,18 @@ static unsigned long calculate_alignment
 static void init_kmem_cache_list(struct kmem_cache *s,
 				struct kmem_cache_list *l)
 {
-	l->cache		= s;
-	l->freelist.nr		= 0;
-	l->freelist.head	= NULL;
-	l->freelist.tail	= NULL;
-	l->nr_partial		= 0;
-	l->nr_slabs		= 0;
+	l->cache		 = s;
+	l->freelist.nr		 = 0;
+	l->freelist.head	 = NULL;
+	l->freelist.tail	 = NULL;
+	l->nr_partial		 = 0;
+	l->nr_slabs		 = 0;
 	INIT_LIST_HEAD(&l->partial);
 
 #ifdef CONFIG_SMP
-	l->remote_free_check	= 0;
+	l->remote_free_check	 = 0;
 	spin_lock_init(&l->remote_free.lock);
-	l->remote_free.list.nr	= 0;
+	l->remote_free.list.nr	 = 0;
 	l->remote_free.list.head = NULL;
 	l->remote_free.list.tail = NULL;
 #endif
@@ -1758,8 +1775,7 @@ static void init_kmem_cache_list(struct 
 #endif
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-				struct kmem_cache_cpu *c)
+static void init_kmem_cache_cpu(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	init_kmem_cache_list(s, &c->list);
 
@@ -1773,8 +1789,8 @@ static void init_kmem_cache_cpu(struct k
 }
 
 #ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s,
-				struct kmem_cache_node *n)
+static void
+init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
 {
 	spin_lock_init(&n->list_lock);
 	init_kmem_cache_list(s, &n->list);
@@ -1804,8 +1820,8 @@ static struct kmem_cache_node kmem_node_
 #endif
 
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-				int cpu)
+static struct kmem_cache_cpu *
+alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c;
 
@@ -1910,7 +1926,7 @@ static int alloc_kmem_cache_nodes(struct
 #endif
 
 /*
- * calculate_sizes() determines the order and the distribution of data within
+ * calc_sizes() determines the order and the distribution of data within
  * a slab object.
  */
 static int calculate_sizes(struct kmem_cache *s)
@@ -1991,7 +2007,7 @@ static int calculate_sizes(struct kmem_c
 	 * user specified and the dynamic determination of cache line size
 	 * on bootup.
 	 */
-	align = calculate_alignment(flags, align, s->objsize);
+	align = calc_alignment(flags, align, s->objsize);
 
 	/*
 	 * SLQB stores one object immediately after another beginning from
@@ -2000,7 +2016,7 @@ static int calculate_sizes(struct kmem_c
 	 */
 	size = ALIGN(size, align);
 	s->size = size;
-	s->order = calculate_order(size);
+	s->order = calc_order(size);
 
 	if (s->order < 0)
 		return 0;
@@ -2210,38 +2226,38 @@ static struct kmem_cache *open_kmalloc_c
  * fls.
  */
 static s8 size_index[24] __cacheline_aligned = {
-	3,	/* 8 */
-	4,	/* 16 */
-	5,	/* 24 */
-	5,	/* 32 */
-	6,	/* 40 */
-	6,	/* 48 */
-	6,	/* 56 */
-	6,	/* 64 */
+	 3,	/* 8 */
+	 4,	/* 16 */
+	 5,	/* 24 */
+	 5,	/* 32 */
+	 6,	/* 40 */
+	 6,	/* 48 */
+	 6,	/* 56 */
+	 6,	/* 64 */
 #if L1_CACHE_BYTES < 64
-	1,	/* 72 */
-	1,	/* 80 */
-	1,	/* 88 */
-	1,	/* 96 */
+	 1,	/* 72 */
+	 1,	/* 80 */
+	 1,	/* 88 */
+	 1,	/* 96 */
 #else
-	7,
-	7,
-	7,
-	7,
-#endif
-	7,	/* 104 */
-	7,	/* 112 */
-	7,	/* 120 */
-	7,	/* 128 */
+	 7,
+	 7,
+	 7,
+	 7,
+#endif
+	 7,	/* 104 */
+	 7,	/* 112 */
+	 7,	/* 120 */
+	 7,	/* 128 */
 #if L1_CACHE_BYTES < 128
-	2,	/* 136 */
-	2,	/* 144 */
-	2,	/* 152 */
-	2,	/* 160 */
-	2,	/* 168 */
-	2,	/* 176 */
-	2,	/* 184 */
-	2	/* 192 */
+	 2,	/* 136 */
+	 2,	/* 144 */
+	 2,	/* 152 */
+	 2,	/* 160 */
+	 2,	/* 168 */
+	 2,	/* 176 */
+	 2,	/* 184 */
+	 2	/* 192 */
 #else
 	-1,
 	-1,
@@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_
 
 void *__kmalloc(size_t size, gfp_t flags)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2291,9 +2306,8 @@ EXPORT_SYMBOL(__kmalloc);
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2340,8 +2354,8 @@ EXPORT_SYMBOL(ksize);
 
 void kfree(const void *object)
 {
-	struct kmem_cache *s;
 	struct slqb_page *page;
+	struct kmem_cache *s;
 
 	if (unlikely(ZERO_OR_NULL_PTR(object)))
 		return;
@@ -2371,21 +2385,21 @@ static void kmem_cache_trim_percpu(void 
 
 int kmem_cache_shrink(struct kmem_cache *s)
 {
-#ifdef CONFIG_NUMA
-	int node;
-#endif
-
 	on_each_cpu(kmem_cache_trim_percpu, s, 1);
 
 #ifdef CONFIG_NUMA
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		struct kmem_cache_node *n = s->node[node];
-		struct kmem_cache_list *l = &n->list;
+	{
+		int node;
 
-		spin_lock_irq(&n->list_lock);
-		claim_remote_free_list(s, l);
-		flush_free_list(s, l);
-		spin_unlock_irq(&n->list_lock);
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
 	}
 #endif
 
@@ -2397,8 +2411,8 @@ EXPORT_SYMBOL(kmem_cache_shrink);
 static void kmem_cache_reap_percpu(void *arg)
 {
 	int cpu = smp_processor_id();
-	struct kmem_cache *s;
 	long phase = (long)arg;
+	struct kmem_cache *s;
 
 	list_for_each_entry(s, &slab_caches, list) {
 		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
@@ -2442,8 +2456,7 @@ static void kmem_cache_reap(void)
 
 static void cache_trim_worker(struct work_struct *w)
 {
-	struct delayed_work *work =
-		container_of(w, struct delayed_work, work);
+	struct delayed_work *work;
 	struct kmem_cache *s;
 	int node;
 
@@ -2469,6 +2482,7 @@ static void cache_trim_worker(struct wor
 
 	up_read(&slqb_lock);
 out:
+	work = container_of(w, struct delayed_work, work);
 	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
 }
 
@@ -2587,8 +2601,8 @@ static int slab_memory_callback(struct n
 
 void __init kmem_cache_init(void)
 {
-	int i;
 	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+	int i;
 
 	/*
 	 * All the ifdefs are rather ugly here, but it's just the setup code,
@@ -2719,8 +2733,9 @@ void __init kmem_cache_init(void)
 /*
  * Some basic slab creation sanity checks
  */
-static int kmem_cache_create_ok(const char *name, size_t size,
-		size_t align, unsigned long flags)
+static int
+kmem_cache_create_ok(const char *name, size_t size,
+		     size_t align, unsigned long flags)
 {
 	struct kmem_cache *tmp;
 
@@ -2773,8 +2788,9 @@ static int kmem_cache_create_ok(const ch
 	return 1;
 }
 
-struct kmem_cache *kmem_cache_create(const char *name, size_t size,
-		size_t align, unsigned long flags, void (*ctor)(void *))
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size,
+		  size_t align, unsigned long flags, void (*ctor)(void *))
 {
 	struct kmem_cache *s;
 
@@ -2804,7 +2820,7 @@ EXPORT_SYMBOL(kmem_cache_create);
  * necessary.
  */
 static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
-				unsigned long action, void *hcpu)
+					 unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	struct kmem_cache *s;
@@ -2855,7 +2871,7 @@ static int __cpuinit slab_cpuup_callback
 }
 
 static struct notifier_block __cpuinitdata slab_notifier = {
-	.notifier_call = slab_cpuup_callback
+	.notifier_call	= slab_cpuup_callback
 };
 
 #endif
@@ -2878,11 +2894,10 @@ void *__kmalloc_track_caller(size_t size
 }
 
 void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
-				unsigned long caller)
+				  unsigned long caller)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2892,12 +2907,17 @@ void *__kmalloc_node_track_caller(size_t
 
 #if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
 struct stats_gather {
-	struct kmem_cache *s;
-	spinlock_t lock;
-	unsigned long nr_slabs;
-	unsigned long nr_partial;
-	unsigned long nr_inuse;
-	unsigned long nr_objects;
+	/*
+	 * Serialize on_each_cpu() instances updating the summary
+	 * stats structure:
+	 */
+	spinlock_t		lock;
+
+	struct kmem_cache	*s;
+	unsigned long		nr_slabs;
+	unsigned long		nr_partial;
+	unsigned long		nr_inuse;
+	unsigned long		nr_objects;
 
 #ifdef CONFIG_SLQB_STATS
 	unsigned long stats[NR_SLQB_STAT_ITEMS];
@@ -2915,25 +2935,25 @@ static void __gather_stats(void *arg)
 	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 	struct kmem_cache_list *l = &c->list;
 	struct slqb_page *page;
-#ifdef CONFIG_SLQB_STATS
-	int i;
-#endif
 
 	nr_slabs = l->nr_slabs;
 	nr_partial = l->nr_partial;
 	nr_inuse = (nr_slabs - nr_partial) * s->objects;
 
-	list_for_each_entry(page, &l->partial, lru) {
+	list_for_each_entry(page, &l->partial, lru)
 		nr_inuse += page->inuse;
-	}
 
 	spin_lock(&gather->lock);
 	gather->nr_slabs += nr_slabs;
 	gather->nr_partial += nr_partial;
 	gather->nr_inuse += nr_inuse;
 #ifdef CONFIG_SLQB_STATS
-	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
-		gather->stats[i] += l->stats[i];
+	{
+		int i;
+
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+			gather->stats[i] += l->stats[i];
+	}
 #endif
 	spin_unlock(&gather->lock);
 }
@@ -2956,14 +2976,15 @@ static void gather_stats(struct kmem_cac
 		struct kmem_cache_list *l = &n->list;
 		struct slqb_page *page;
 		unsigned long flags;
-#ifdef CONFIG_SLQB_STATS
-		int i;
-#endif
 
 		spin_lock_irqsave(&n->list_lock, flags);
 #ifdef CONFIG_SLQB_STATS
-		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
-			stats->stats[i] += l->stats[i];
+		{
+			int i;
+
+			for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+				stats->stats[i] += l->stats[i];
+		}
 #endif
 		stats->nr_slabs += l->nr_slabs;
 		stats->nr_partial += l->nr_partial;
@@ -3039,14 +3060,15 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
 			stats.nr_slabs, 0UL);
 	seq_putc(m, '\n');
+
 	return 0;
 }
 
 static const struct seq_operations slabinfo_op = {
-	.start = s_start,
-	.next = s_next,
-	.stop = s_stop,
-	.show = s_show,
+	.start		= s_start,
+	.next		= s_next,
+	.stop		= s_stop,
+	.show		= s_show,
 };
 
 static int slabinfo_open(struct inode *inode, struct file *file)
@@ -3205,8 +3227,8 @@ static ssize_t store_user_show(struct km
 }
 SLAB_ATTR_RO(store_user);
 
-static ssize_t hiwater_store(struct kmem_cache *s,
-				const char *buf, size_t length)
+static ssize_t
+hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
 {
 	long hiwater;
 	int err;
@@ -3229,8 +3251,8 @@ static ssize_t hiwater_show(struct kmem_
 }
 SLAB_ATTR(hiwater);
 
-static ssize_t freebatch_store(struct kmem_cache *s,
-				const char *buf, size_t length)
+static ssize_t
+freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
 {
 	long freebatch;
 	int err;
@@ -3258,91 +3280,95 @@ static int show_stat(struct kmem_cache *
 {
 	struct stats_gather stats;
 	int len;
-#ifdef CONFIG_SMP
-	int cpu;
-#endif
 
 	gather_stats(s, &stats);
 
 	len = sprintf(buf, "%lu", stats.stats[si]);
 
 #ifdef CONFIG_SMP
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-		struct kmem_cache_list *l = &c->list;
+	{
+		int cpu;
 
-		if (len < PAGE_SIZE - 20)
-			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+		for_each_online_cpu(cpu) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_list *l = &c->list;
+
+			if (len < PAGE_SIZE - 20) {
+				len += sprintf(buf+len,
+						" C%d=%lu", cpu, l->stats[si]);
+			}
+		}
 	}
 #endif
 	return len + sprintf(buf + len, "\n");
 }
 
-#define STAT_ATTR(si, text) 					\
+#define STAT_ATTR(si, text)					\
 static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
 {								\
 	return show_stat(s, buf, si);				\
 }								\
 SLAB_ATTR_RO(text);						\
 
-STAT_ATTR(ALLOC, alloc);
-STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
-STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
-STAT_ATTR(FREE, free);
-STAT_ATTR(FREE_REMOTE, free_remote);
-STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
-STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
-STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
-STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
-STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
-STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
-STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
-STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
-STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+STAT_ATTR(ALLOC,			alloc);
+STAT_ATTR(ALLOC_SLAB_FILL,		alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW,		alloc_slab_new);
+STAT_ATTR(FREE,				free);
+STAT_ATTR(FREE_REMOTE,			free_remote);
+STAT_ATTR(FLUSH_FREE_LIST,		flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS,	flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE,	flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL,		flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE,		flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST,		flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS,	flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST,		claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS,	claim_remote_list_objects);
 #endif
 
 static struct attribute *slab_attrs[] = {
-	&slab_size_attr.attr,
-	&object_size_attr.attr,
-	&objs_per_slab_attr.attr,
-	&order_attr.attr,
-	&objects_attr.attr,
-	&total_objects_attr.attr,
-	&slabs_attr.attr,
-	&ctor_attr.attr,
-	&align_attr.attr,
-	&hwcache_align_attr.attr,
-	&reclaim_account_attr.attr,
-	&destroy_by_rcu_attr.attr,
-	&red_zone_attr.attr,
-	&poison_attr.attr,
-	&store_user_attr.attr,
-	&hiwater_attr.attr,
-	&freebatch_attr.attr,
+
+	&                 slab_size_attr.attr,
+	&               object_size_attr.attr,
+	&             objs_per_slab_attr.attr,
+	&                     order_attr.attr,
+	&                   objects_attr.attr,
+	&             total_objects_attr.attr,
+	&                     slabs_attr.attr,
+	&                      ctor_attr.attr,
+	&                     align_attr.attr,
+	&             hwcache_align_attr.attr,
+	&           reclaim_account_attr.attr,
+	&            destroy_by_rcu_attr.attr,
+	&                  red_zone_attr.attr,
+	&                    poison_attr.attr,
+	&                store_user_attr.attr,
+	&                   hiwater_attr.attr,
+	&                 freebatch_attr.attr,
 #ifdef CONFIG_ZONE_DMA
-	&cache_dma_attr.attr,
+	&                 cache_dma_attr.attr,
 #endif
 #ifdef CONFIG_SLQB_STATS
-	&alloc_attr.attr,
-	&alloc_slab_fill_attr.attr,
-	&alloc_slab_new_attr.attr,
-	&free_attr.attr,
-	&free_remote_attr.attr,
-	&flush_free_list_attr.attr,
-	&flush_free_list_objects_attr.attr,
-	&flush_free_list_remote_attr.attr,
-	&flush_slab_partial_attr.attr,
-	&flush_slab_free_attr.attr,
-	&flush_rfree_list_attr.attr,
-	&flush_rfree_list_objects_attr.attr,
-	&claim_remote_list_attr.attr,
-	&claim_remote_list_objects_attr.attr,
+	&                     alloc_attr.attr,
+	&           alloc_slab_fill_attr.attr,
+	&            alloc_slab_new_attr.attr,
+	&                      free_attr.attr,
+	&               free_remote_attr.attr,
+	&           flush_free_list_attr.attr,
+	&   flush_free_list_objects_attr.attr,
+	&    flush_free_list_remote_attr.attr,
+	&        flush_slab_partial_attr.attr,
+	&           flush_slab_free_attr.attr,
+	&          flush_rfree_list_attr.attr,
+	&  flush_rfree_list_objects_attr.attr,
+	&         claim_remote_list_attr.attr,
+	& claim_remote_list_objects_attr.attr,
 #endif
 	NULL
 };
 
 static struct attribute_group slab_attr_group = {
-	.attrs = slab_attrs,
+	.attrs		= slab_attrs,
 };
 
 static ssize_t slab_attr_show(struct kobject *kobj,
@@ -3389,13 +3415,13 @@ static void kmem_cache_release(struct ko
 }
 
 static struct sysfs_ops slab_sysfs_ops = {
-	.show = slab_attr_show,
-	.store = slab_attr_store,
+	.show		= slab_attr_show,
+	.store		= slab_attr_store,
 };
 
 static struct kobj_type slab_ktype = {
-	.sysfs_ops = &slab_sysfs_ops,
-	.release = kmem_cache_release
+	.sysfs_ops	= &slab_sysfs_ops,
+	.release	= kmem_cache_release
 };
 
 static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -3413,7 +3439,7 @@ static struct kset_uevent_ops slab_ueven
 
 static struct kset *slab_kset;
 
-static int sysfs_available __read_mostly = 0;
+static int sysfs_available __read_mostly;
 
 static int sysfs_slab_add(struct kmem_cache *s)
 {
@@ -3462,9 +3488,11 @@ static int __init slab_sysfs_init(void)
 
 	list_for_each_entry(s, &slab_caches, list) {
 		err = sysfs_slab_add(s);
-		if (err)
-			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
-						" to sysfs\n", s->name);
+		if (!err)
+			continue;
+
+		printk(KERN_ERR
+			"SLQB: Unable to add boot slab %s to sysfs\n", s->name);
 	}
 
 	up_write(&slqb_lock);

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 17:40       ` Ingo Molnar
  0 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-21 17:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> > 
> > Mind if i nitpick a bit about minor style issues? Since this is going to 
> > be the next Linux SLAB allocator we might as well do it perfectly :-)
> 
> Well here is an incremental patch which should get most of the issues 
> you pointed out, most of the sane ones that checkpatch pointed out, and 
> a few of my own ;)

here's an incremental one ontop of your incremental patch, enhancing some 
more issues. I now find the code very readable! :-)

( in case you are wondering about the placement of bit_spinlock.h - that 
  file needs fixing, just move it to the top of the file and see the build 
  break. But that's a separate patch.)

	Ingo

------------------->
Subject: slbq: cleanup
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Jan 21 18:10:20 CET 2009

mm/slqb.o:

   text	   data	    bss	    dec	    hex	filename
  17655	  54159	 200456	 272270	  4278e	slqb.o.before
  17653	  54159	 200456	 272268	  4278c	slqb.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 mm/slqb.c |  588 ++++++++++++++++++++++++++++++++------------------------------
 1 file changed, 308 insertions(+), 280 deletions(-)

Index: linux/mm/slqb.c
===================================================================
--- linux.orig/mm/slqb.c
+++ linux/mm/slqb.c
@@ -7,19 +7,20 @@
  * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
  */
 
-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bit_spinlock.h>
 #include <linux/interrupt.h>
-#include <linux/bitops.h>
-#include <linux/slab.h>
-#include <linux/seq_file.h>
-#include <linux/cpu.h>
-#include <linux/cpuset.h>
 #include <linux/mempolicy.h>
-#include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include <linux/seq_file.h>
+#include <linux/bitops.h>
+#include <linux/cpuset.h>
 #include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/mm.h>
+
+#include <linux/bit_spinlock.h>
 
 /*
  * TODO
@@ -40,14 +41,14 @@
 struct slqb_page {
 	union {
 		struct {
-			unsigned long	flags;		/* mandatory */
-			atomic_t	_count;		/* mandatory */
-			unsigned int	inuse;		/* Nr of objects */
+			unsigned long	flags;		/* mandatory	   */
+			atomic_t	_count;		/* mandatory	   */
+			unsigned int	inuse;		/* Nr of objects   */
 			struct kmem_cache_list *list;	/* Pointer to list */
-			void		 **freelist;	/* LIFO freelist */
+			void		 **freelist;	/* LIFO freelist   */
 			union {
-				struct list_head lru;	/* misc. list */
-				struct rcu_head rcu_head; /* for rcu freeing */
+				struct list_head lru;	/* misc. list	   */
+				struct rcu_head	rcu_head; /* for rcu freeing */
 			};
 		};
 		struct page page;
@@ -120,16 +121,16 @@ static inline int slab_freebatch(struct 
  * - There is no remote free queue. Nodes don't free objects, CPUs do.
  */
 
-static inline void slqb_stat_inc(struct kmem_cache_list *list,
-				enum stat_item si)
+static inline void
+slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si]++;
 #endif
 }
 
-static inline void slqb_stat_add(struct kmem_cache_list *list,
-				enum stat_item si, unsigned long nr)
+static inline void
+slqb_stat_add(struct kmem_cache_list *list, enum stat_item si, unsigned long nr)
 {
 #ifdef CONFIG_SLQB_STATS
 	list->stats[si] += nr;
@@ -196,12 +197,12 @@ static inline void __free_slqb_pages(str
 #ifdef CONFIG_SLQB_DEBUG
 static inline int slab_debug(struct kmem_cache *s)
 {
-	return (s->flags &
+	return s->flags &
 			(SLAB_DEBUG_FREE |
 			 SLAB_RED_ZONE |
 			 SLAB_POISON |
 			 SLAB_STORE_USER |
-			 SLAB_TRACE));
+			 SLAB_TRACE);
 }
 static inline int slab_poison(struct kmem_cache *s)
 {
@@ -574,34 +575,34 @@ static int check_bytes_and_report(struct
  * Object layout:
  *
  * object address
- * 	Bytes of the object to be managed.
- * 	If the freepointer may overlay the object then the free
- * 	pointer is the first word of the object.
+ *	Bytes of the object to be managed.
+ *	If the freepointer may overlay the object then the free
+ *	pointer is the first word of the object.
  *
- * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
- * 	0xa5 (POISON_END)
+ *	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ *	0xa5 (POISON_END)
  *
  * object + s->objsize
- * 	Padding to reach word boundary. This is also used for Redzoning.
- * 	Padding is extended by another word if Redzoning is enabled and
- * 	objsize == inuse.
+ *	Padding to reach word boundary. This is also used for Redzoning.
+ *	Padding is extended by another word if Redzoning is enabled and
+ *	objsize == inuse.
  *
- * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
- * 	0xcc (RED_ACTIVE) for objects in use.
+ *	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ *	0xcc (RED_ACTIVE) for objects in use.
  *
  * object + s->inuse
- * 	Meta data starts here.
+ *	Meta data starts here.
  *
- * 	A. Free pointer (if we cannot overwrite object on free)
- * 	B. Tracking data for SLAB_STORE_USER
- * 	C. Padding to reach required alignment boundary or at mininum
- * 		one word if debuggin is on to be able to detect writes
- * 		before the word boundary.
+ *	A. Free pointer (if we cannot overwrite object on free)
+ *	B. Tracking data for SLAB_STORE_USER
+ *	C. Padding to reach required alignment boundary or at mininum
+ *		one word if debuggin is on to be able to detect writes
+ *		before the word boundary.
  *
  *	Padding is done using 0x5a (POISON_INUSE)
  *
  * object + s->size
- * 	Nothing is used beyond s->size.
+ *	Nothing is used beyond s->size.
  */
 
 static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
@@ -717,25 +718,26 @@ static int check_slab(struct kmem_cache 
 	return 1;
 }
 
-static void trace(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int alloc)
+static void
+trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
 {
-	if (s->flags & SLAB_TRACE) {
-		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
-			s->name,
-			alloc ? "alloc" : "free",
-			object, page->inuse,
-			page->freelist);
+	if (likely(!(s->flags & SLAB_TRACE)))
+		return;
 
-		if (!alloc)
-			print_section("Object", (void *)object, s->objsize);
+	printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+		s->name,
+		alloc ? "alloc" : "free",
+		object, page->inuse,
+		page->freelist);
 
-		dump_stack();
-	}
+	if (!alloc)
+		print_section("Object", (void *)object, s->objsize);
+
+	dump_stack();
 }
 
-static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
-				void *object)
+static void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	if (!slab_debug(s))
 		return;
@@ -747,11 +749,10 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s,
-					void *object, void *addr)
+static int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
-	struct slqb_page *page;
-	page = virt_to_head_slqb_page(object);
+	struct slqb_page *page = virt_to_head_slqb_page(object);
 
 	if (!check_slab(s, page))
 		goto bad;
@@ -767,6 +768,7 @@ static int alloc_debug_processing(struct
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
+
 	trace(s, page, object, 1);
 	init_object(s, object, 1);
 	return 1;
@@ -775,11 +777,9 @@ bad:
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s,
-					void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
-	struct slqb_page *page;
-	page = virt_to_head_slqb_page(object);
+	struct slqb_page *page = virt_to_head_slqb_page(object);
 
 	if (!check_slab(s, page))
 		goto fail;
@@ -870,29 +870,34 @@ static unsigned long kmem_cache_flags(un
 				void (*ctor)(void *))
 {
 	/*
-	 * Enable debugging if selected on the kernel commandline.
+	 * Enable debugging if selected on the kernel commandline:
 	 */
-	if (slqb_debug && (!slqb_debug_slabs ||
-	    strncmp(slqb_debug_slabs, name,
-		strlen(slqb_debug_slabs)) == 0))
-			flags |= slqb_debug;
+
+	if (!slqb_debug)
+		return flags;
+
+	if (slqb_debug_slabs)
+		return flags | slqb_debug;
+
+	if (!strncmp(slqb_debug_slabs, name, strlen(slqb_debug_slabs)))
+		return flags | slqb_debug;
 
 	return flags;
 }
 #else
-static inline void setup_object_debug(struct kmem_cache *s,
-			struct slqb_page *page, void *object)
+static inline void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 }
 
-static inline int alloc_debug_processing(struct kmem_cache *s,
-			void *object, void *addr)
+static inline int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
 	return 0;
 }
 
-static inline int free_debug_processing(struct kmem_cache *s,
-			void *object, void *addr)
+static inline int
+free_debug_processing(struct kmem_cache *s, void *object, void *addr)
 {
 	return 0;
 }
@@ -903,7 +908,7 @@ static inline int slab_pad_check(struct 
 }
 
 static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
-			void *object, int active)
+			       void *object, int active)
 {
 	return 1;
 }
@@ -924,11 +929,11 @@ static const int slqb_debug = 0;
 /*
  * allocate a new slab (return its corresponding struct slqb_page)
  */
-static struct slqb_page *allocate_slab(struct kmem_cache *s,
-					gfp_t flags, int node)
+static struct slqb_page *
+allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct slqb_page *page;
 	int pages = 1 << s->order;
+	struct slqb_page *page;
 
 	flags |= s->allocflags;
 
@@ -947,8 +952,8 @@ static struct slqb_page *allocate_slab(s
 /*
  * Called once for each object on a new slab page
  */
-static void setup_object(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static void
+setup_object(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
@@ -958,8 +963,8 @@ static void setup_object(struct kmem_cac
 /*
  * Allocate a new slab, set up its object list.
  */
-static struct slqb_page *new_slab_page(struct kmem_cache *s,
-				gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *
+new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
 {
 	struct slqb_page *page;
 	void *start;
@@ -1030,6 +1035,7 @@ static void rcu_free_slab(struct rcu_hea
 static void free_slab(struct kmem_cache *s, struct slqb_page *page)
 {
 	VM_BUG_ON(page->inuse);
+
 	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
 		call_rcu(&page->rcu_head, rcu_free_slab);
 	else
@@ -1060,12 +1066,14 @@ static int free_object_to_page(struct km
 		l->nr_slabs--;
 		free_slab(s, page);
 		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+
 		return 1;
 
 	} else if (page->inuse + 1 == s->objects) {
 		l->nr_partial++;
 		list_add(&page->lru, &l->partial);
 		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+
 		return 0;
 	}
 	return 0;
@@ -1146,8 +1154,8 @@ static void flush_free_list_all(struct k
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static void claim_remote_free_list(struct kmem_cache *s,
-					struct kmem_cache_list *l)
+static void
+claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	void **head, **tail;
 	int nr;
@@ -1192,8 +1200,8 @@ static void claim_remote_free_list(struc
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
-						struct kmem_cache_list *l)
+static __always_inline void *
+__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	void *object;
 
@@ -1243,8 +1251,8 @@ static __always_inline void *__cache_lis
  * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
  * list_lock in the case of per-node list.
  */
-static noinline void *__cache_list_get_page(struct kmem_cache *s,
-				struct kmem_cache_list *l)
+static noinline void *
+__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
 {
 	struct slqb_page *page;
 	void *object;
@@ -1282,12 +1290,12 @@ static noinline void *__cache_list_get_p
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__slab_alloc_page(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline void *
+__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	struct slqb_page *page;
 	struct kmem_cache_list *l;
 	struct kmem_cache_cpu *c;
+	struct slqb_page *page;
 	unsigned int colour;
 	void *object;
 
@@ -1347,15 +1355,19 @@ static noinline void *__slab_alloc_page(
 }
 
 #ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline int
+alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
 	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
 		return node;
-	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD)) {
 		return cpuset_mem_spread_node();
-	else if (current->mempolicy)
-		return slab_node(current->mempolicy);
+	} else {
+		if (current->mempolicy)
+			return slab_node(current->mempolicy);
+	}
+
 	return node;
 }
 
@@ -1365,8 +1377,8 @@ static noinline int alternate_nid(struct
  *
  * Must be called with interrupts disabled.
  */
-static noinline void *__remote_slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static noinline void *
+__remote_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
 	struct kmem_cache_node *n;
 	struct kmem_cache_list *l;
@@ -1375,6 +1387,7 @@ static noinline void *__remote_slab_allo
 	n = s->node[node];
 	if (unlikely(!n)) /* node has no memory */
 		return NULL;
+
 	l = &n->list;
 
 	spin_lock(&n->list_lock);
@@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
 	}
 	if (likely(object))
 		slqb_stat_inc(l, ALLOC);
+
 	spin_unlock(&n->list_lock);
+
 	return object;
 }
 #endif
@@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
  *
  * Must be called with interrupts disabled.
  */
-static __always_inline void *__slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node)
+static __always_inline void *
+__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	void *object;
-	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	void *object;
 
 #ifdef CONFIG_NUMA
 	if (unlikely(node != -1) && unlikely(node != numa_node_id()))
@@ -1422,6 +1437,7 @@ static __always_inline void *__slab_allo
 	}
 	if (likely(object))
 		slqb_stat_inc(l, ALLOC);
+
 	return object;
 }
 
@@ -1429,11 +1445,11 @@ static __always_inline void *__slab_allo
  * Perform some interrupts-on processing around the main allocation path
  * (debug checking and memset()ing).
  */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node, void *addr)
+static __always_inline void *
+slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
 {
-	void *object;
 	unsigned long flags;
+	void *object;
 
 again:
 	local_irq_save(flags);
@@ -1451,10 +1467,11 @@ again:
 	return object;
 }
 
-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, void *caller)
+static __always_inline void *
+__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
 {
 	int node = -1;
+
 #ifdef CONFIG_NUMA
 	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
 		node = alternate_nid(s, gfpflags, node);
@@ -1487,8 +1504,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
  *
  * Must be called with interrupts disabled.
  */
-static void flush_remote_free_cache(struct kmem_cache *s,
-				struct kmem_cache_cpu *c)
+static void
+flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	struct kmlist *src;
 	struct kmem_cache_list *dst;
@@ -1575,12 +1592,12 @@ static noinline void slab_free_to_remote
  *
  * Must be called with interrupts disabled.
  */
-static __always_inline void __slab_free(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static __always_inline void
+__slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
-	struct kmem_cache_cpu *c;
-	struct kmem_cache_list *l;
 	int thiscpu = smp_processor_id();
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
 
 	c = get_cpu_slab(s, thiscpu);
 	l = &c->list;
@@ -1619,8 +1636,8 @@ static __always_inline void __slab_free(
  * Perform some interrupts-on processing around the main freeing path
  * (debug checking).
  */
-static __always_inline void slab_free(struct kmem_cache *s,
-				struct slqb_page *page, void *object)
+static __always_inline void
+slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
 {
 	unsigned long flags;
 
@@ -1683,7 +1700,7 @@ static int slab_order(int size, int max_
 	return order;
 }
 
-static int calculate_order(int size)
+static int calc_order(int size)
 {
 	int order;
 
@@ -1710,8 +1727,8 @@ static int calculate_order(int size)
 /*
  * Figure out what the alignment of the objects will be.
  */
-static unsigned long calculate_alignment(unsigned long flags,
-				unsigned long align, unsigned long size)
+static unsigned long
+calc_alignment(unsigned long flags, unsigned long align, unsigned long size)
 {
 	/*
 	 * If the user wants hardware cache aligned objects then follow that
@@ -1737,18 +1754,18 @@ static unsigned long calculate_alignment
 static void init_kmem_cache_list(struct kmem_cache *s,
 				struct kmem_cache_list *l)
 {
-	l->cache		= s;
-	l->freelist.nr		= 0;
-	l->freelist.head	= NULL;
-	l->freelist.tail	= NULL;
-	l->nr_partial		= 0;
-	l->nr_slabs		= 0;
+	l->cache		 = s;
+	l->freelist.nr		 = 0;
+	l->freelist.head	 = NULL;
+	l->freelist.tail	 = NULL;
+	l->nr_partial		 = 0;
+	l->nr_slabs		 = 0;
 	INIT_LIST_HEAD(&l->partial);
 
 #ifdef CONFIG_SMP
-	l->remote_free_check	= 0;
+	l->remote_free_check	 = 0;
 	spin_lock_init(&l->remote_free.lock);
-	l->remote_free.list.nr	= 0;
+	l->remote_free.list.nr	 = 0;
 	l->remote_free.list.head = NULL;
 	l->remote_free.list.tail = NULL;
 #endif
@@ -1758,8 +1775,7 @@ static void init_kmem_cache_list(struct 
 #endif
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-				struct kmem_cache_cpu *c)
+static void init_kmem_cache_cpu(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	init_kmem_cache_list(s, &c->list);
 
@@ -1773,8 +1789,8 @@ static void init_kmem_cache_cpu(struct k
 }
 
 #ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s,
-				struct kmem_cache_node *n)
+static void
+init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
 {
 	spin_lock_init(&n->list_lock);
 	init_kmem_cache_list(s, &n->list);
@@ -1804,8 +1820,8 @@ static struct kmem_cache_node kmem_node_
 #endif
 
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-				int cpu)
+static struct kmem_cache_cpu *
+alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c;
 
@@ -1910,7 +1926,7 @@ static int alloc_kmem_cache_nodes(struct
 #endif
 
 /*
- * calculate_sizes() determines the order and the distribution of data within
+ * calc_sizes() determines the order and the distribution of data within
  * a slab object.
  */
 static int calculate_sizes(struct kmem_cache *s)
@@ -1991,7 +2007,7 @@ static int calculate_sizes(struct kmem_c
 	 * user specified and the dynamic determination of cache line size
 	 * on bootup.
 	 */
-	align = calculate_alignment(flags, align, s->objsize);
+	align = calc_alignment(flags, align, s->objsize);
 
 	/*
 	 * SLQB stores one object immediately after another beginning from
@@ -2000,7 +2016,7 @@ static int calculate_sizes(struct kmem_c
 	 */
 	size = ALIGN(size, align);
 	s->size = size;
-	s->order = calculate_order(size);
+	s->order = calc_order(size);
 
 	if (s->order < 0)
 		return 0;
@@ -2210,38 +2226,38 @@ static struct kmem_cache *open_kmalloc_c
  * fls.
  */
 static s8 size_index[24] __cacheline_aligned = {
-	3,	/* 8 */
-	4,	/* 16 */
-	5,	/* 24 */
-	5,	/* 32 */
-	6,	/* 40 */
-	6,	/* 48 */
-	6,	/* 56 */
-	6,	/* 64 */
+	 3,	/* 8 */
+	 4,	/* 16 */
+	 5,	/* 24 */
+	 5,	/* 32 */
+	 6,	/* 40 */
+	 6,	/* 48 */
+	 6,	/* 56 */
+	 6,	/* 64 */
 #if L1_CACHE_BYTES < 64
-	1,	/* 72 */
-	1,	/* 80 */
-	1,	/* 88 */
-	1,	/* 96 */
+	 1,	/* 72 */
+	 1,	/* 80 */
+	 1,	/* 88 */
+	 1,	/* 96 */
 #else
-	7,
-	7,
-	7,
-	7,
-#endif
-	7,	/* 104 */
-	7,	/* 112 */
-	7,	/* 120 */
-	7,	/* 128 */
+	 7,
+	 7,
+	 7,
+	 7,
+#endif
+	 7,	/* 104 */
+	 7,	/* 112 */
+	 7,	/* 120 */
+	 7,	/* 128 */
 #if L1_CACHE_BYTES < 128
-	2,	/* 136 */
-	2,	/* 144 */
-	2,	/* 152 */
-	2,	/* 160 */
-	2,	/* 168 */
-	2,	/* 176 */
-	2,	/* 184 */
-	2	/* 192 */
+	 2,	/* 136 */
+	 2,	/* 144 */
+	 2,	/* 152 */
+	 2,	/* 160 */
+	 2,	/* 168 */
+	 2,	/* 176 */
+	 2,	/* 184 */
+	 2	/* 192 */
 #else
 	-1,
 	-1,
@@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_
 
 void *__kmalloc(size_t size, gfp_t flags)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2291,9 +2306,8 @@ EXPORT_SYMBOL(__kmalloc);
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2340,8 +2354,8 @@ EXPORT_SYMBOL(ksize);
 
 void kfree(const void *object)
 {
-	struct kmem_cache *s;
 	struct slqb_page *page;
+	struct kmem_cache *s;
 
 	if (unlikely(ZERO_OR_NULL_PTR(object)))
 		return;
@@ -2371,21 +2385,21 @@ static void kmem_cache_trim_percpu(void 
 
 int kmem_cache_shrink(struct kmem_cache *s)
 {
-#ifdef CONFIG_NUMA
-	int node;
-#endif
-
 	on_each_cpu(kmem_cache_trim_percpu, s, 1);
 
 #ifdef CONFIG_NUMA
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		struct kmem_cache_node *n = s->node[node];
-		struct kmem_cache_list *l = &n->list;
+	{
+		int node;
 
-		spin_lock_irq(&n->list_lock);
-		claim_remote_free_list(s, l);
-		flush_free_list(s, l);
-		spin_unlock_irq(&n->list_lock);
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
 	}
 #endif
 
@@ -2397,8 +2411,8 @@ EXPORT_SYMBOL(kmem_cache_shrink);
 static void kmem_cache_reap_percpu(void *arg)
 {
 	int cpu = smp_processor_id();
-	struct kmem_cache *s;
 	long phase = (long)arg;
+	struct kmem_cache *s;
 
 	list_for_each_entry(s, &slab_caches, list) {
 		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
@@ -2442,8 +2456,7 @@ static void kmem_cache_reap(void)
 
 static void cache_trim_worker(struct work_struct *w)
 {
-	struct delayed_work *work =
-		container_of(w, struct delayed_work, work);
+	struct delayed_work *work;
 	struct kmem_cache *s;
 	int node;
 
@@ -2469,6 +2482,7 @@ static void cache_trim_worker(struct wor
 
 	up_read(&slqb_lock);
 out:
+	work = container_of(w, struct delayed_work, work);
 	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
 }
 
@@ -2587,8 +2601,8 @@ static int slab_memory_callback(struct n
 
 void __init kmem_cache_init(void)
 {
-	int i;
 	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+	int i;
 
 	/*
 	 * All the ifdefs are rather ugly here, but it's just the setup code,
@@ -2719,8 +2733,9 @@ void __init kmem_cache_init(void)
 /*
  * Some basic slab creation sanity checks
  */
-static int kmem_cache_create_ok(const char *name, size_t size,
-		size_t align, unsigned long flags)
+static int
+kmem_cache_create_ok(const char *name, size_t size,
+		     size_t align, unsigned long flags)
 {
 	struct kmem_cache *tmp;
 
@@ -2773,8 +2788,9 @@ static int kmem_cache_create_ok(const ch
 	return 1;
 }
 
-struct kmem_cache *kmem_cache_create(const char *name, size_t size,
-		size_t align, unsigned long flags, void (*ctor)(void *))
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size,
+		  size_t align, unsigned long flags, void (*ctor)(void *))
 {
 	struct kmem_cache *s;
 
@@ -2804,7 +2820,7 @@ EXPORT_SYMBOL(kmem_cache_create);
  * necessary.
  */
 static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
-				unsigned long action, void *hcpu)
+					 unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	struct kmem_cache *s;
@@ -2855,7 +2871,7 @@ static int __cpuinit slab_cpuup_callback
 }
 
 static struct notifier_block __cpuinitdata slab_notifier = {
-	.notifier_call = slab_cpuup_callback
+	.notifier_call	= slab_cpuup_callback
 };
 
 #endif
@@ -2878,11 +2894,10 @@ void *__kmalloc_track_caller(size_t size
 }
 
 void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
-				unsigned long caller)
+				  unsigned long caller)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = get_slab(size, flags);
 
-	s = get_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
@@ -2892,12 +2907,17 @@ void *__kmalloc_node_track_caller(size_t
 
 #if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
 struct stats_gather {
-	struct kmem_cache *s;
-	spinlock_t lock;
-	unsigned long nr_slabs;
-	unsigned long nr_partial;
-	unsigned long nr_inuse;
-	unsigned long nr_objects;
+	/*
+	 * Serialize on_each_cpu() instances updating the summary
+	 * stats structure:
+	 */
+	spinlock_t		lock;
+
+	struct kmem_cache	*s;
+	unsigned long		nr_slabs;
+	unsigned long		nr_partial;
+	unsigned long		nr_inuse;
+	unsigned long		nr_objects;
 
 #ifdef CONFIG_SLQB_STATS
 	unsigned long stats[NR_SLQB_STAT_ITEMS];
@@ -2915,25 +2935,25 @@ static void __gather_stats(void *arg)
 	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 	struct kmem_cache_list *l = &c->list;
 	struct slqb_page *page;
-#ifdef CONFIG_SLQB_STATS
-	int i;
-#endif
 
 	nr_slabs = l->nr_slabs;
 	nr_partial = l->nr_partial;
 	nr_inuse = (nr_slabs - nr_partial) * s->objects;
 
-	list_for_each_entry(page, &l->partial, lru) {
+	list_for_each_entry(page, &l->partial, lru)
 		nr_inuse += page->inuse;
-	}
 
 	spin_lock(&gather->lock);
 	gather->nr_slabs += nr_slabs;
 	gather->nr_partial += nr_partial;
 	gather->nr_inuse += nr_inuse;
 #ifdef CONFIG_SLQB_STATS
-	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
-		gather->stats[i] += l->stats[i];
+	{
+		int i;
+
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+			gather->stats[i] += l->stats[i];
+	}
 #endif
 	spin_unlock(&gather->lock);
 }
@@ -2956,14 +2976,15 @@ static void gather_stats(struct kmem_cac
 		struct kmem_cache_list *l = &n->list;
 		struct slqb_page *page;
 		unsigned long flags;
-#ifdef CONFIG_SLQB_STATS
-		int i;
-#endif
 
 		spin_lock_irqsave(&n->list_lock, flags);
 #ifdef CONFIG_SLQB_STATS
-		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
-			stats->stats[i] += l->stats[i];
+		{
+			int i;
+
+			for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+				stats->stats[i] += l->stats[i];
+		}
 #endif
 		stats->nr_slabs += l->nr_slabs;
 		stats->nr_partial += l->nr_partial;
@@ -3039,14 +3060,15 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
 			stats.nr_slabs, 0UL);
 	seq_putc(m, '\n');
+
 	return 0;
 }
 
 static const struct seq_operations slabinfo_op = {
-	.start = s_start,
-	.next = s_next,
-	.stop = s_stop,
-	.show = s_show,
+	.start		= s_start,
+	.next		= s_next,
+	.stop		= s_stop,
+	.show		= s_show,
 };
 
 static int slabinfo_open(struct inode *inode, struct file *file)
@@ -3205,8 +3227,8 @@ static ssize_t store_user_show(struct km
 }
 SLAB_ATTR_RO(store_user);
 
-static ssize_t hiwater_store(struct kmem_cache *s,
-				const char *buf, size_t length)
+static ssize_t
+hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
 {
 	long hiwater;
 	int err;
@@ -3229,8 +3251,8 @@ static ssize_t hiwater_show(struct kmem_
 }
 SLAB_ATTR(hiwater);
 
-static ssize_t freebatch_store(struct kmem_cache *s,
-				const char *buf, size_t length)
+static ssize_t
+freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
 {
 	long freebatch;
 	int err;
@@ -3258,91 +3280,95 @@ static int show_stat(struct kmem_cache *
 {
 	struct stats_gather stats;
 	int len;
-#ifdef CONFIG_SMP
-	int cpu;
-#endif
 
 	gather_stats(s, &stats);
 
 	len = sprintf(buf, "%lu", stats.stats[si]);
 
 #ifdef CONFIG_SMP
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-		struct kmem_cache_list *l = &c->list;
+	{
+		int cpu;
 
-		if (len < PAGE_SIZE - 20)
-			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+		for_each_online_cpu(cpu) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_list *l = &c->list;
+
+			if (len < PAGE_SIZE - 20) {
+				len += sprintf(buf+len,
+						" C%d=%lu", cpu, l->stats[si]);
+			}
+		}
 	}
 #endif
 	return len + sprintf(buf + len, "\n");
 }
 
-#define STAT_ATTR(si, text) 					\
+#define STAT_ATTR(si, text)					\
 static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
 {								\
 	return show_stat(s, buf, si);				\
 }								\
 SLAB_ATTR_RO(text);						\
 
-STAT_ATTR(ALLOC, alloc);
-STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
-STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
-STAT_ATTR(FREE, free);
-STAT_ATTR(FREE_REMOTE, free_remote);
-STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
-STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
-STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
-STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
-STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
-STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
-STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
-STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
-STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+STAT_ATTR(ALLOC,			alloc);
+STAT_ATTR(ALLOC_SLAB_FILL,		alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW,		alloc_slab_new);
+STAT_ATTR(FREE,				free);
+STAT_ATTR(FREE_REMOTE,			free_remote);
+STAT_ATTR(FLUSH_FREE_LIST,		flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS,	flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE,	flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL,		flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE,		flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST,		flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS,	flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST,		claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS,	claim_remote_list_objects);
 #endif
 
 static struct attribute *slab_attrs[] = {
-	&slab_size_attr.attr,
-	&object_size_attr.attr,
-	&objs_per_slab_attr.attr,
-	&order_attr.attr,
-	&objects_attr.attr,
-	&total_objects_attr.attr,
-	&slabs_attr.attr,
-	&ctor_attr.attr,
-	&align_attr.attr,
-	&hwcache_align_attr.attr,
-	&reclaim_account_attr.attr,
-	&destroy_by_rcu_attr.attr,
-	&red_zone_attr.attr,
-	&poison_attr.attr,
-	&store_user_attr.attr,
-	&hiwater_attr.attr,
-	&freebatch_attr.attr,
+
+	&                 slab_size_attr.attr,
+	&               object_size_attr.attr,
+	&             objs_per_slab_attr.attr,
+	&                     order_attr.attr,
+	&                   objects_attr.attr,
+	&             total_objects_attr.attr,
+	&                     slabs_attr.attr,
+	&                      ctor_attr.attr,
+	&                     align_attr.attr,
+	&             hwcache_align_attr.attr,
+	&           reclaim_account_attr.attr,
+	&            destroy_by_rcu_attr.attr,
+	&                  red_zone_attr.attr,
+	&                    poison_attr.attr,
+	&                store_user_attr.attr,
+	&                   hiwater_attr.attr,
+	&                 freebatch_attr.attr,
 #ifdef CONFIG_ZONE_DMA
-	&cache_dma_attr.attr,
+	&                 cache_dma_attr.attr,
 #endif
 #ifdef CONFIG_SLQB_STATS
-	&alloc_attr.attr,
-	&alloc_slab_fill_attr.attr,
-	&alloc_slab_new_attr.attr,
-	&free_attr.attr,
-	&free_remote_attr.attr,
-	&flush_free_list_attr.attr,
-	&flush_free_list_objects_attr.attr,
-	&flush_free_list_remote_attr.attr,
-	&flush_slab_partial_attr.attr,
-	&flush_slab_free_attr.attr,
-	&flush_rfree_list_attr.attr,
-	&flush_rfree_list_objects_attr.attr,
-	&claim_remote_list_attr.attr,
-	&claim_remote_list_objects_attr.attr,
+	&                     alloc_attr.attr,
+	&           alloc_slab_fill_attr.attr,
+	&            alloc_slab_new_attr.attr,
+	&                      free_attr.attr,
+	&               free_remote_attr.attr,
+	&           flush_free_list_attr.attr,
+	&   flush_free_list_objects_attr.attr,
+	&    flush_free_list_remote_attr.attr,
+	&        flush_slab_partial_attr.attr,
+	&           flush_slab_free_attr.attr,
+	&          flush_rfree_list_attr.attr,
+	&  flush_rfree_list_objects_attr.attr,
+	&         claim_remote_list_attr.attr,
+	& claim_remote_list_objects_attr.attr,
 #endif
 	NULL
 };
 
 static struct attribute_group slab_attr_group = {
-	.attrs = slab_attrs,
+	.attrs		= slab_attrs,
 };
 
 static ssize_t slab_attr_show(struct kobject *kobj,
@@ -3389,13 +3415,13 @@ static void kmem_cache_release(struct ko
 }
 
 static struct sysfs_ops slab_sysfs_ops = {
-	.show = slab_attr_show,
-	.store = slab_attr_store,
+	.show		= slab_attr_show,
+	.store		= slab_attr_store,
 };
 
 static struct kobj_type slab_ktype = {
-	.sysfs_ops = &slab_sysfs_ops,
-	.release = kmem_cache_release
+	.sysfs_ops	= &slab_sysfs_ops,
+	.release	= kmem_cache_release
 };
 
 static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -3413,7 +3439,7 @@ static struct kset_uevent_ops slab_ueven
 
 static struct kset *slab_kset;
 
-static int sysfs_available __read_mostly = 0;
+static int sysfs_available __read_mostly;
 
 static int sysfs_slab_add(struct kmem_cache *s)
 {
@@ -3462,9 +3488,11 @@ static int __init slab_sysfs_init(void)
 
 	list_for_each_entry(s, &slab_caches, list) {
 		err = sysfs_slab_add(s);
-		if (err)
-			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
-						" to sysfs\n", s->name);
+		if (!err)
+			continue;
+
+		printk(KERN_ERR
+			"SLQB: Unable to add boot slab %s to sysfs\n", s->name);
 	}
 
 	up_write(&slqb_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:30 ` Nick Piggin
@ 2009-01-21 17:59   ` Joe Perches
  -1 siblings, 0 replies; 197+ messages in thread
From: Joe Perches @ 2009-01-21 17:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

One thing you might consider is that
Q is visually close enough to O to be
misread.

Perhaps a different letter would be good.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 17:59   ` Joe Perches
  0 siblings, 0 replies; 197+ messages in thread
From: Joe Perches @ 2009-01-21 17:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

One thing you might consider is that
Q is visually close enough to O to be
misread.

Perhaps a different letter would be good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:30 ` Nick Piggin
@ 2009-01-21 18:10   ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-21 18:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, 21 Jan 2009, Nick Piggin wrote:
> 
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with), and improved performance and reduced
> locking somewhat for node-specific and interleaved allocations.

I haven't reviewed your postings, but I did give the previous version
of your patch a try on all my machines.  Some observations and one patch.

I was initially _very_ impressed by how well it did on my venerable
tmpfs loop swapping loads, where I'd expected next to no effect; but
that turned out to be because on three machines I'd been using SLUB,
without remembering how default slub_max_order got raised from 1 to 3
in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).

That's been making SLUB behave pretty badly (e.g. elapsed time 30%
more than SLAB) with swapping loads on most of my machines.  Though
oddly one seems immune, and another takes four times as long: guess
it depends on how close to thrashing, but probably more to investigate
there.  I think my original SLUB versus SLAB comparisons were done on
the immune one: as I remember, SLUB and SLAB were equivalent on those
loads when SLUB came in, but even with boot option slub_max_order=1,
SLUB is still slower than SLAB on such tests (e.g. 2% slower).
FWIW - swapping loads are not what anybody should tune for.

So in fact SLQB comes in very much like SLAB, as I think you'd expect:
slightly ahead of it on most of the machines, but probably in the noise.
(SLOB behaves decently: not a winner, but no catastrophic behaviour.)

What I love most about SLUB is the way you can reasonably build with
CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
debugging you want with a boot option when you want it.  That was a
great stride forward, which you've followed in SLQB: so I'd have to
prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

I do hate the name SLQB.  Despite having no experience of databases,
I find it almost impossible to type, coming out as SQLB most times.
Wish you'd invented a plausible vowel instead of the Q; but probably
too late for that.

init/Kconfig describes it as "Qeued allocator": should say "Queued".

Documentation/vm/slqbinfo.c gives several compilation warnings:
I'd rather leave it to you to fix them, maybe the unused variables
are about to be used, or maybe there's much worse wrong with it
than a few compilation warnings, I didn't investigate.

The only bug I found (but you'll probably want to change the patch
- which I've rediffed to today's slqb.c, but not retested).

On fake NUMA I hit kernel BUG at mm/slqb.c:1107!  claim_remote_free_list()
is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
for one, and even if others are somehow safe today, it will be more robust
to take the lock sooner.

I moved the prefetchw(head) down to where we know it's going to be the head,
and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
better remove altogether: once we got the lock, it's hardly interesting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/slqb.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

--- slqb/mm/slqb.c.orig	2009-01-21 15:23:54.000000000 +0000
+++ slqb/mm/slqb.c	2009-01-21 15:32:44.000000000 +0000
@@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
 	void **head, **tail;
 	int nr;
 
-	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
-
 	if (!l->remote_free.list.nr)
 		return;
 
+	spin_lock(&l->remote_free.lock);
 	l->remote_free_check = 0;
 	head = l->remote_free.list.head;
-	/* Get the head hot for the likely subsequent allocation or flush */
-	prefetchw(head);
-
-	spin_lock(&l->remote_free.lock);
 	l->remote_free.list.head = NULL;
 	tail = l->remote_free.list.tail;
 	l->remote_free.list.tail = NULL;
@@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
 	l->remote_free.list.nr = 0;
 	spin_unlock(&l->remote_free.lock);
 
-	if (!l->freelist.nr)
+	WARN_ON(!head + !tail != !nr + !nr);
+	if (!nr)
+		return;
+
+	if (!l->freelist.nr) {
+		/* Get head hot for likely subsequent allocation or flush */
+		prefetchw(head);
 		l->freelist.head = head;
-	else
+	} else
 		set_freepointer(s, l->freelist.tail, head);
 	l->freelist.tail = tail;
 

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-21 18:10   ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-21 18:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, 21 Jan 2009, Nick Piggin wrote:
> 
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with), and improved performance and reduced
> locking somewhat for node-specific and interleaved allocations.

I haven't reviewed your postings, but I did give the previous version
of your patch a try on all my machines.  Some observations and one patch.

I was initially _very_ impressed by how well it did on my venerable
tmpfs loop swapping loads, where I'd expected next to no effect; but
that turned out to be because on three machines I'd been using SLUB,
without remembering how default slub_max_order got raised from 1 to 3
in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).

That's been making SLUB behave pretty badly (e.g. elapsed time 30%
more than SLAB) with swapping loads on most of my machines.  Though
oddly one seems immune, and another takes four times as long: guess
it depends on how close to thrashing, but probably more to investigate
there.  I think my original SLUB versus SLAB comparisons were done on
the immune one: as I remember, SLUB and SLAB were equivalent on those
loads when SLUB came in, but even with boot option slub_max_order=1,
SLUB is still slower than SLAB on such tests (e.g. 2% slower).
FWIW - swapping loads are not what anybody should tune for.

So in fact SLQB comes in very much like SLAB, as I think you'd expect:
slightly ahead of it on most of the machines, but probably in the noise.
(SLOB behaves decently: not a winner, but no catastrophic behaviour.)

What I love most about SLUB is the way you can reasonably build with
CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
debugging you want with a boot option when you want it.  That was a
great stride forward, which you've followed in SLQB: so I'd have to
prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

I do hate the name SLQB.  Despite having no experience of databases,
I find it almost impossible to type, coming out as SQLB most times.
Wish you'd invented a plausible vowel instead of the Q; but probably
too late for that.

init/Kconfig describes it as "Qeued allocator": should say "Queued".

Documentation/vm/slqbinfo.c gives several compilation warnings:
I'd rather leave it to you to fix them, maybe the unused variables
are about to be used, or maybe there's much worse wrong with it
than a few compilation warnings, I didn't investigate.

The only bug I found (but you'll probably want to change the patch
- which I've rediffed to today's slqb.c, but not retested).

On fake NUMA I hit kernel BUG at mm/slqb.c:1107!  claim_remote_free_list()
is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
for one, and even if others are somehow safe today, it will be more robust
to take the lock sooner.

I moved the prefetchw(head) down to where we know it's going to be the head,
and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
better remove altogether: once we got the lock, it's hardly interesting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/slqb.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

--- slqb/mm/slqb.c.orig	2009-01-21 15:23:54.000000000 +0000
+++ slqb/mm/slqb.c	2009-01-21 15:32:44.000000000 +0000
@@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
 	void **head, **tail;
 	int nr;
 
-	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
-
 	if (!l->remote_free.list.nr)
 		return;
 
+	spin_lock(&l->remote_free.lock);
 	l->remote_free_check = 0;
 	head = l->remote_free.list.head;
-	/* Get the head hot for the likely subsequent allocation or flush */
-	prefetchw(head);
-
-	spin_lock(&l->remote_free.lock);
 	l->remote_free.list.head = NULL;
 	tail = l->remote_free.list.tail;
 	l->remote_free.list.tail = NULL;
@@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
 	l->remote_free.list.nr = 0;
 	spin_unlock(&l->remote_free.lock);
 
-	if (!l->freelist.nr)
+	WARN_ON(!head + !tail != !nr + !nr);
+	if (!nr)
+		return;
+
+	if (!l->freelist.nr) {
+		/* Get head hot for likely subsequent allocation or flush */
+		prefetchw(head);
 		l->freelist.head = head;
-	else
+	} else
 		set_freepointer(s, l->freelist.tail, head);
 	l->freelist.tail = tail;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:30 ` Nick Piggin
@ 2009-01-22  8:45   ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> Hi,
> 
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with), 
Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
mem=2G.

The call chain is mnt_init => sysfs_init. kmem_cache_create fails, so later on
when mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.

Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
sysfs_init => kmem_cache_create fails.


------------------log----------------

Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
Mount-cache hash table entries: 1024
mnt_init: sysfs_init error: -12
Unable to handle kernel NULL pointer dereference (address 0000000000002058)
swapper[0]: Oops 8813272891392 [1]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 00001010084a2018 ifs : 8000000000000690 ip  : [<a000000100180350>]    Not tainted (2.6.29-rc2slqb0121)
ip is at kmem_cache_alloc+0x150/0x4e0
unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
rnat: 0009804c8a70433f bsps: a000000100f484b0 pr  : 656960155aa65959
ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
csd : 893fffff000f0000 ssd : 893fffff00090000
b0  : a000000100180270 b6  : a000000100507360 b7  : a000000100507360
f6  : 000000000000000000000 f7  : 1003e0000000000000800
f8  : 1003e0000000000000008 f9  : 1003e0000000000000001
f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
r1  : a0000001011bc810 r2  : 0000000000000008 r3  : ffffffffffffffff
r8  : 0000000000000000 r9  : a000000100ded800 r10 : 0000000000000000
r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001

Call Trace:
 [<a000000100016240>] show_stack+0x40/0xa0
                                sp=a000000100db3950 bsp=a000000100dad140
 [<a000000100016b50>] show_regs+0x850/0x8a0
                                sp=a000000100db3b20 bsp=a000000100dad0e8
 [<a00000010003a5f0>] die+0x230/0x360
                                sp=a000000100db3b20 bsp=a000000100dad0a0
 [<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
                                sp=a000000100db3b20 bsp=a000000100dad050
 [<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
                                sp=a000000100db3bb0 bsp=a000000100dad050
 [<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
                                sp=a000000100db3d80 bsp=a000000100dacfc8
 [<a000000100238610>] sysfs_new_dirent+0x90/0x240
                                sp=a000000100db3d80 bsp=a000000100dacf80
 [<a000000100239140>] create_dir+0x40/0x100
                                sp=a000000100db3d90 bsp=a000000100dacf48
 [<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
                                sp=a000000100db3db0 bsp=a000000100dacf28
 [<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
                                sp=a000000100db3dc0 bsp=a000000100dacee8
 [<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
                                sp=a000000100db3dc0 bsp=a000000100daceb0
 [<a0000001004ed620>] kobject_add+0x100/0x140
                                sp=a000000100db3dc0 bsp=a000000100dace50
 [<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
                                sp=a000000100db3e00 bsp=a000000100dace20
 [<a000000100c28ff0>] mnt_init+0x1b0/0x480
                                sp=a000000100db3e00 bsp=a000000100dacde0
 [<a000000100c28610>] vfs_caches_init+0x230/0x280
                                sp=a000000100db3e20 bsp=a000000100dacdb8
 [<a000000100c01410>] start_kernel+0x830/0x8c0
                                sp=a000000100db3e20 bsp=a000000100dacd40
 [<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
                                sp=a000000100db3e30 bsp=a000000100dacca0
Kernel panic - not syncing: Attempted to kill the idle task!



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  8:45   ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> Hi,
> 
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with), 
Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
mem=2G.

The call chain is mnt_init => sysfs_init. i>>?kmem_cache_create fails, so later on
when i>>?mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.

Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
i>>?sysfs_init => kmem_cache_create fails.


------------------log----------------

Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
Mount-cache hash table entries: 1024
mnt_init: sysfs_init error: -12
Unable to handle kernel NULL pointer dereference (address 0000000000002058)
swapper[0]: Oops 8813272891392 [1]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 00001010084a2018 ifs : 8000000000000690 ip  : [<a000000100180350>]    Not tainted (2.6.29-rc2slqb0121)
ip is at kmem_cache_alloc+0x150/0x4e0
unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
rnat: 0009804c8a70433f bsps: a000000100f484b0 pr  : 656960155aa65959
ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
csd : 893fffff000f0000 ssd : 893fffff00090000
b0  : a000000100180270 b6  : a000000100507360 b7  : a000000100507360
f6  : 000000000000000000000 f7  : 1003e0000000000000800
f8  : 1003e0000000000000008 f9  : 1003e0000000000000001
f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
r1  : a0000001011bc810 r2  : 0000000000000008 r3  : ffffffffffffffff
r8  : 0000000000000000 r9  : a000000100ded800 r10 : 0000000000000000
r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001

Call Trace:
 [<a000000100016240>] show_stack+0x40/0xa0
                                sp=a000000100db3950 bsp=a000000100dad140
 [<a000000100016b50>] show_regs+0x850/0x8a0
                                sp=a000000100db3b20 bsp=a000000100dad0e8
 [<a00000010003a5f0>] die+0x230/0x360
                                sp=a000000100db3b20 bsp=a000000100dad0a0
 [<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
                                sp=a000000100db3b20 bsp=a000000100dad050
 [<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
                                sp=a000000100db3bb0 bsp=a000000100dad050
 [<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
                                sp=a000000100db3d80 bsp=a000000100dacfc8
 [<a000000100238610>] sysfs_new_dirent+0x90/0x240
                                sp=a000000100db3d80 bsp=a000000100dacf80
 [<a000000100239140>] create_dir+0x40/0x100
                                sp=a000000100db3d90 bsp=a000000100dacf48
 [<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
                                sp=a000000100db3db0 bsp=a000000100dacf28
 [<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
                                sp=a000000100db3dc0 bsp=a000000100dacee8
 [<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
                                sp=a000000100db3dc0 bsp=a000000100daceb0
 [<a0000001004ed620>] kobject_add+0x100/0x140
                                sp=a000000100db3dc0 bsp=a000000100dace50
 [<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
                                sp=a000000100db3e00 bsp=a000000100dace20
 [<a000000100c28ff0>] mnt_init+0x1b0/0x480
                                sp=a000000100db3e00 bsp=a000000100dacde0
 [<a000000100c28610>] vfs_caches_init+0x230/0x280
                                sp=a000000100db3e20 bsp=a000000100dacdb8
 [<a000000100c01410>] start_kernel+0x830/0x8c0
                                sp=a000000100db3e20 bsp=a000000100dacd40
 [<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
                                sp=a000000100db3e30 bsp=a000000100dacca0
Kernel panic - not syncing: Attempted to kill the idle task!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 18:10   ` Hugh Dickins
@ 2009-01-22 10:01     ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22 10:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
>
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines.  Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there.  I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

What kind of machine are you seeing this on? It sounds like it could
be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
("slub: Calculate min_objects based on number of processors").

                                Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22 10:01     ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22 10:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
>
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines.  Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there.  I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

What kind of machine are you seeing this on? It sounds like it could
be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
("slub: Calculate min_objects based on number of processors").

                                Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22 10:01     ` Pekka Enberg
@ 2009-01-22 12:47       ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-22 12:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > I was initially _very_ impressed by how well it did on my venerable
> > tmpfs loop swapping loads, where I'd expected next to no effect; but
> > that turned out to be because on three machines I'd been using SLUB,
> > without remembering how default slub_max_order got raised from 1 to 3
> > in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
> >
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines.  Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there.  I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
> 
> What kind of machine are you seeing this on? It sounds like it could
> be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> ("slub: Calculate min_objects based on number of processors").

Thanks, yes, that could well account for the residual difference: the
machines in question have 2 or 4 cpus, so the old slub_min_objects=4
has effectively become slub_min_objects=12 or slub_min_objects=16.

I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
lines (though I'll need to curtail tests on a couple of machines),
and will report back later.

It's great that SLUB provides these knobs; not so great that it needs them.

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22 12:47       ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-22 12:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > I was initially _very_ impressed by how well it did on my venerable
> > tmpfs loop swapping loads, where I'd expected next to no effect; but
> > that turned out to be because on three machines I'd been using SLUB,
> > without remembering how default slub_max_order got raised from 1 to 3
> > in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
> >
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines.  Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there.  I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
> 
> What kind of machine are you seeing this on? It sounds like it could
> be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> ("slub: Calculate min_objects based on number of processors").

Thanks, yes, that could well account for the residual difference: the
machines in question have 2 or 4 cpus, so the old slub_min_objects=4
has effectively become slub_min_objects=12 or slub_min_objects=16.

I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
lines (though I'll need to curtail tests on a couple of machines),
and will report back later.

It's great that SLUB provides these knobs; not so great that it needs them.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 17:40       ` Ingo Molnar
@ 2009-01-23  3:31         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> > > 
> > > Mind if i nitpick a bit about minor style issues? Since this is going to 
> > > be the next Linux SLAB allocator we might as well do it perfectly :-)
> > 
> > Well here is an incremental patch which should get most of the issues 
> > you pointed out, most of the sane ones that checkpatch pointed out, and 
> > a few of my own ;)
> 
> here's an incremental one ontop of your incremental patch, enhancing some 
> more issues. I now find the code very readable! :-)

Thanks! I'll go through it and apply it. I'll raise any issues if I
am particularly against them ;)

> ( in case you are wondering about the placement of bit_spinlock.h - that 
>   file needs fixing, just move it to the top of the file and see the build 
>   break. But that's a separate patch.)

Ah, SLQB doesn't use bit spinlocks anyway, so I'll just get rid of that.
I'll see if there are any other obviously unneeded headers too.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  3:31         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> > > 
> > > Mind if i nitpick a bit about minor style issues? Since this is going to 
> > > be the next Linux SLAB allocator we might as well do it perfectly :-)
> > 
> > Well here is an incremental patch which should get most of the issues 
> > you pointed out, most of the sane ones that checkpatch pointed out, and 
> > a few of my own ;)
> 
> here's an incremental one ontop of your incremental patch, enhancing some 
> more issues. I now find the code very readable! :-)

Thanks! I'll go through it and apply it. I'll raise any issues if I
am particularly against them ;)

> ( in case you are wondering about the placement of bit_spinlock.h - that 
>   file needs fixing, just move it to the top of the file and see the build 
>   break. But that's a separate patch.)

Ah, SLQB doesn't use bit spinlocks anyway, so I'll just get rid of that.
I'll see if there are any other obviously unneeded headers too.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 17:59   ` Joe Perches
@ 2009-01-23  3:35     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:35 UTC (permalink / raw)
  To: Joe Perches
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 09:59:30AM -0800, Joe Perches wrote:
> One thing you might consider is that
> Q is visually close enough to O to be
> misread.
> 
> Perhaps a different letter would be good.

That's a fair point. Hugh dislikes it too, I see ;) What to do... I
had been toying with the idea that if slqb (or slub) becomes "the"
allocator, then we could rename it all back to slAb after replacing
the existing slab?

Or I could make it a 128 bit allocator and call it SLZB, which would
definitely make it "the final" allocator ;)

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  3:35     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:35 UTC (permalink / raw)
  To: Joe Perches
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 09:59:30AM -0800, Joe Perches wrote:
> One thing you might consider is that
> Q is visually close enough to O to be
> misread.
> 
> Perhaps a different letter would be good.

That's a fair point. Hugh dislikes it too, I see ;) What to do... I
had been toying with the idea that if slqb (or slub) becomes "the"
allocator, then we could rename it all back to slAb after replacing
the existing slab?

Or I could make it a 128 bit allocator and call it SLZB, which would
definitely make it "the final" allocator ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 18:10   ` Hugh Dickins
@ 2009-01-23  3:55     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> On Wed, 21 Jan 2009, Nick Piggin wrote:
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), and improved performance and reduced
> > locking somewhat for node-specific and interleaved allocations.
> 
> I haven't reviewed your postings, but I did give the previous version
> of your patch a try on all my machines.  Some observations and one patch.

Great, thanks!

 
> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
> 
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines.  Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there.  I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

Yeah, that's to be expected with higher order allocations I think. Does
your immune machine simply have fewer CPUs and thus doesn't use such
high order allocations?

 
> So in fact SLQB comes in very much like SLAB, as I think you'd expect:
> slightly ahead of it on most of the machines, but probably in the noise.
> (SLOB behaves decently: not a winner, but no catastrophic behaviour.)
> 
> What I love most about SLUB is the way you can reasonably build with
> CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
> debugging you want with a boot option when you want it.  That was a
> great stride forward, which you've followed in SLQB: so I'd have to
> prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

It is nice. All credit to Christoph for that (and the fine grained
sysfs code). 

 
> I do hate the name SLQB.  Despite having no experience of databases,
> I find it almost impossible to type, coming out as SQLB most times.
> Wish you'd invented a plausible vowel instead of the Q; but probably
> too late for that.

Yeah, apologies for the name :P

 
> init/Kconfig describes it as "Qeued allocator": should say "Queued".

Thanks.


> Documentation/vm/slqbinfo.c gives several compilation warnings:
> I'd rather leave it to you to fix them, maybe the unused variables
> are about to be used, or maybe there's much worse wrong with it
> than a few compilation warnings, I didn't investigate.

OK.
 

> The only bug I found (but you'll probably want to change the patch
> - which I've rediffed to today's slqb.c, but not retested).
> 
> On fake NUMA I hit kernel BUG at mm/slqb.c:1107!  claim_remote_free_list()
> is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
> for one, and even if others are somehow safe today, it will be more robust
> to take the lock sooner.
 
Good catch, thanks. The BUG should be OK where it is if we only
claim the remote free list when remote_free_check is is set, but
some of the periodic reaping and teardown code calls it unconditionally.
But it's not critical so it should definitely go inside the lock.


> I moved the prefetchw(head) down to where we know it's going to be the head,
> and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
> better remove altogether: once we got the lock, it's hardly interesting.

Right, I'll probably do that. Thanks!

> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/slqb.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> --- slqb/mm/slqb.c.orig	2009-01-21 15:23:54.000000000 +0000
> +++ slqb/mm/slqb.c	2009-01-21 15:32:44.000000000 +0000
> @@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
>  	void **head, **tail;
>  	int nr;
>  
> -	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> -
>  	if (!l->remote_free.list.nr)
>  		return;
>  
> +	spin_lock(&l->remote_free.lock);
>  	l->remote_free_check = 0;
>  	head = l->remote_free.list.head;
> -	/* Get the head hot for the likely subsequent allocation or flush */
> -	prefetchw(head);
> -
> -	spin_lock(&l->remote_free.lock);
>  	l->remote_free.list.head = NULL;
>  	tail = l->remote_free.list.tail;
>  	l->remote_free.list.tail = NULL;
> @@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
>  	l->remote_free.list.nr = 0;
>  	spin_unlock(&l->remote_free.lock);
>  
> -	if (!l->freelist.nr)
> +	WARN_ON(!head + !tail != !nr + !nr);
> +	if (!nr)
> +		return;
> +
> +	if (!l->freelist.nr) {
> +		/* Get head hot for likely subsequent allocation or flush */
> +		prefetchw(head);
>  		l->freelist.head = head;
> -	else
> +	} else
>  		set_freepointer(s, l->freelist.tail, head);
>  	l->freelist.tail = tail;
>  

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  3:55     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> On Wed, 21 Jan 2009, Nick Piggin wrote:
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), and improved performance and reduced
> > locking somewhat for node-specific and interleaved allocations.
> 
> I haven't reviewed your postings, but I did give the previous version
> of your patch a try on all my machines.  Some observations and one patch.

Great, thanks!

 
> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
> 
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines.  Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there.  I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

Yeah, that's to be expected with higher order allocations I think. Does
your immune machine simply have fewer CPUs and thus doesn't use such
high order allocations?

 
> So in fact SLQB comes in very much like SLAB, as I think you'd expect:
> slightly ahead of it on most of the machines, but probably in the noise.
> (SLOB behaves decently: not a winner, but no catastrophic behaviour.)
> 
> What I love most about SLUB is the way you can reasonably build with
> CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
> debugging you want with a boot option when you want it.  That was a
> great stride forward, which you've followed in SLQB: so I'd have to
> prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

It is nice. All credit to Christoph for that (and the fine grained
sysfs code). 

 
> I do hate the name SLQB.  Despite having no experience of databases,
> I find it almost impossible to type, coming out as SQLB most times.
> Wish you'd invented a plausible vowel instead of the Q; but probably
> too late for that.

Yeah, apologies for the name :P

 
> init/Kconfig describes it as "Qeued allocator": should say "Queued".

Thanks.


> Documentation/vm/slqbinfo.c gives several compilation warnings:
> I'd rather leave it to you to fix them, maybe the unused variables
> are about to be used, or maybe there's much worse wrong with it
> than a few compilation warnings, I didn't investigate.

OK.
 

> The only bug I found (but you'll probably want to change the patch
> - which I've rediffed to today's slqb.c, but not retested).
> 
> On fake NUMA I hit kernel BUG at mm/slqb.c:1107!  claim_remote_free_list()
> is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
> for one, and even if others are somehow safe today, it will be more robust
> to take the lock sooner.
 
Good catch, thanks. The BUG should be OK where it is if we only
claim the remote free list when remote_free_check is is set, but
some of the periodic reaping and teardown code calls it unconditionally.
But it's not critical so it should definitely go inside the lock.


> I moved the prefetchw(head) down to where we know it's going to be the head,
> and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
> better remove altogether: once we got the lock, it's hardly interesting.

Right, I'll probably do that. Thanks!

> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/slqb.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> --- slqb/mm/slqb.c.orig	2009-01-21 15:23:54.000000000 +0000
> +++ slqb/mm/slqb.c	2009-01-21 15:32:44.000000000 +0000
> @@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
>  	void **head, **tail;
>  	int nr;
>  
> -	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> -
>  	if (!l->remote_free.list.nr)
>  		return;
>  
> +	spin_lock(&l->remote_free.lock);
>  	l->remote_free_check = 0;
>  	head = l->remote_free.list.head;
> -	/* Get the head hot for the likely subsequent allocation or flush */
> -	prefetchw(head);
> -
> -	spin_lock(&l->remote_free.lock);
>  	l->remote_free.list.head = NULL;
>  	tail = l->remote_free.list.tail;
>  	l->remote_free.list.tail = NULL;
> @@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
>  	l->remote_free.list.nr = 0;
>  	spin_unlock(&l->remote_free.lock);
>  
> -	if (!l->freelist.nr)
> +	WARN_ON(!head + !tail != !nr + !nr);
> +	if (!nr)
> +		return;
> +
> +	if (!l->freelist.nr) {
> +		/* Get head hot for likely subsequent allocation or flush */
> +		prefetchw(head);
>  		l->freelist.head = head;
> -	else
> +	} else
>  		set_freepointer(s, l->freelist.tail, head);
>  	l->freelist.tail = tail;
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  8:45   ` Zhang, Yanmin
@ 2009-01-23  3:57     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:57 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), 
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
> 
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
> 
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Hmm, I'll probably have to add a bit more fallback logic. I'll have to
work out what semantics the callers require here. Thanks for the report.

> 
> 
> ------------------log----------------
> 
> Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Mount-cache hash table entries: 1024
> mnt_init: sysfs_init error: -12
> Unable to handle kernel NULL pointer dereference (address 0000000000002058)
> swapper[0]: Oops 8813272891392 [1]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 00001010084a2018 ifs : 8000000000000690 ip  : [<a000000100180350>]    Not tainted (2.6.29-rc2slqb0121)
> ip is at kmem_cache_alloc+0x150/0x4e0
> unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
> rnat: 0009804c8a70433f bsps: a000000100f484b0 pr  : 656960155aa65959
> ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
> csd : 893fffff000f0000 ssd : 893fffff00090000
> b0  : a000000100180270 b6  : a000000100507360 b7  : a000000100507360
> f6  : 000000000000000000000 f7  : 1003e0000000000000800
> f8  : 1003e0000000000000008 f9  : 1003e0000000000000001
> f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
> r1  : a0000001011bc810 r2  : 0000000000000008 r3  : ffffffffffffffff
> r8  : 0000000000000000 r9  : a000000100ded800 r10 : 0000000000000000
> r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
> r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
> r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
> r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
> r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
> r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001
> 
> Call Trace:
>  [<a000000100016240>] show_stack+0x40/0xa0
>                                 sp=a000000100db3950 bsp=a000000100dad140
>  [<a000000100016b50>] show_regs+0x850/0x8a0
>                                 sp=a000000100db3b20 bsp=a000000100dad0e8
>  [<a00000010003a5f0>] die+0x230/0x360
>                                 sp=a000000100db3b20 bsp=a000000100dad0a0
>  [<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
>                                 sp=a000000100db3b20 bsp=a000000100dad050
>  [<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
>                                 sp=a000000100db3bb0 bsp=a000000100dad050
>  [<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
>                                 sp=a000000100db3d80 bsp=a000000100dacfc8
>  [<a000000100238610>] sysfs_new_dirent+0x90/0x240
>                                 sp=a000000100db3d80 bsp=a000000100dacf80
>  [<a000000100239140>] create_dir+0x40/0x100
>                                 sp=a000000100db3d90 bsp=a000000100dacf48
>  [<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
>                                 sp=a000000100db3db0 bsp=a000000100dacf28
>  [<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
>                                 sp=a000000100db3dc0 bsp=a000000100dacee8
>  [<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
>                                 sp=a000000100db3dc0 bsp=a000000100daceb0
>  [<a0000001004ed620>] kobject_add+0x100/0x140
>                                 sp=a000000100db3dc0 bsp=a000000100dace50
>  [<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
>                                 sp=a000000100db3e00 bsp=a000000100dace20
>  [<a000000100c28ff0>] mnt_init+0x1b0/0x480
>                                 sp=a000000100db3e00 bsp=a000000100dacde0
>  [<a000000100c28610>] vfs_caches_init+0x230/0x280
>                                 sp=a000000100db3e20 bsp=a000000100dacdb8
>  [<a000000100c01410>] start_kernel+0x830/0x8c0
>                                 sp=a000000100db3e20 bsp=a000000100dacd40
>  [<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
>                                 sp=a000000100db3e30 bsp=a000000100dacca0
> Kernel panic - not syncing: Attempted to kill the idle task!
> 

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  3:57     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  3:57 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), 
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
> 
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
> 
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Hmm, I'll probably have to add a bit more fallback logic. I'll have to
work out what semantics the callers require here. Thanks for the report.

> 
> 
> ------------------log----------------
> 
> Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Mount-cache hash table entries: 1024
> mnt_init: sysfs_init error: -12
> Unable to handle kernel NULL pointer dereference (address 0000000000002058)
> swapper[0]: Oops 8813272891392 [1]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 00001010084a2018 ifs : 8000000000000690 ip  : [<a000000100180350>]    Not tainted (2.6.29-rc2slqb0121)
> ip is at kmem_cache_alloc+0x150/0x4e0
> unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
> rnat: 0009804c8a70433f bsps: a000000100f484b0 pr  : 656960155aa65959
> ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
> csd : 893fffff000f0000 ssd : 893fffff00090000
> b0  : a000000100180270 b6  : a000000100507360 b7  : a000000100507360
> f6  : 000000000000000000000 f7  : 1003e0000000000000800
> f8  : 1003e0000000000000008 f9  : 1003e0000000000000001
> f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
> r1  : a0000001011bc810 r2  : 0000000000000008 r3  : ffffffffffffffff
> r8  : 0000000000000000 r9  : a000000100ded800 r10 : 0000000000000000
> r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
> r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
> r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
> r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
> r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
> r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001
> 
> Call Trace:
>  [<a000000100016240>] show_stack+0x40/0xa0
>                                 sp=a000000100db3950 bsp=a000000100dad140
>  [<a000000100016b50>] show_regs+0x850/0x8a0
>                                 sp=a000000100db3b20 bsp=a000000100dad0e8
>  [<a00000010003a5f0>] die+0x230/0x360
>                                 sp=a000000100db3b20 bsp=a000000100dad0a0
>  [<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
>                                 sp=a000000100db3b20 bsp=a000000100dad050
>  [<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
>                                 sp=a000000100db3bb0 bsp=a000000100dad050
>  [<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
>                                 sp=a000000100db3d80 bsp=a000000100dacfc8
>  [<a000000100238610>] sysfs_new_dirent+0x90/0x240
>                                 sp=a000000100db3d80 bsp=a000000100dacf80
>  [<a000000100239140>] create_dir+0x40/0x100
>                                 sp=a000000100db3d90 bsp=a000000100dacf48
>  [<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
>                                 sp=a000000100db3db0 bsp=a000000100dacf28
>  [<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
>                                 sp=a000000100db3dc0 bsp=a000000100dacee8
>  [<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
>                                 sp=a000000100db3dc0 bsp=a000000100daceb0
>  [<a0000001004ed620>] kobject_add+0x100/0x140
>                                 sp=a000000100db3dc0 bsp=a000000100dace50
>  [<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
>                                 sp=a000000100db3e00 bsp=a000000100dace20
>  [<a000000100c28ff0>] mnt_init+0x1b0/0x480
>                                 sp=a000000100db3e00 bsp=a000000100dacde0
>  [<a000000100c28610>] vfs_caches_init+0x230/0x280
>                                 sp=a000000100db3e20 bsp=a000000100dacdb8
>  [<a000000100c01410>] start_kernel+0x830/0x8c0
>                                 sp=a000000100db3e20 bsp=a000000100dacd40
>  [<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
>                                 sp=a000000100db3e30 bsp=a000000100dacca0
> Kernel panic - not syncing: Attempted to kill the idle task!
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  3:35     ` Nick Piggin
@ 2009-01-23  4:00       ` Joe Perches
  -1 siblings, 0 replies; 197+ messages in thread
From: Joe Perches @ 2009-01-23  4:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, 2009-01-23 at 04:35 +0100, Nick Piggin wrote:
> That's a fair point. Hugh dislikes it too, I see ;) What to do... I
> had been toying with the idea that if slqb (or slub) becomes "the"
> allocator, then we could rename it all back to slAb after replacing
> the existing slab?

maybe SLIB (slab-improved) or SLAB_NG or NSLAB or SLABX
Who says it has to be 4 letters?

> Or I could make it a 128 bit allocator and call it SLZB, which would
> definitely make it "the final" allocator ;)

That leads to the phone book game.

SLZZB - and a crystal bridge now spans the fissure.

Hmm, wrong game.

cheers, j


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  4:00       ` Joe Perches
  0 siblings, 0 replies; 197+ messages in thread
From: Joe Perches @ 2009-01-23  4:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, 2009-01-23 at 04:35 +0100, Nick Piggin wrote:
> That's a fair point. Hugh dislikes it too, I see ;) What to do... I
> had been toying with the idea that if slqb (or slub) becomes "the"
> allocator, then we could rename it all back to slAb after replacing
> the existing slab?

maybe SLIB (slab-improved) or SLAB_NG or NSLAB or SLABX
Who says it has to be 4 letters?

> Or I could make it a 128 bit allocator and call it SLZB, which would
> definitely make it "the final" allocator ;)

That leads to the phone book game.

SLZZB - and a crystal bridge now spans the fissure.

Hmm, wrong game.

cheers, j

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 17:40       ` Ingo Molnar
@ 2009-01-23  6:14         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  6:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> -				enum stat_item si)
> +static inline void
> +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
>  {

Hmm, I'm not entirely fond of this style. The former scales to longer lines
with just a single style change (putting args into new lines), wheras the
latter first moves its prefixes to a newline, then moves args as the
line grows even longer.

I guess it is a matter of taste, not wrong either way... but I think most
of the mm code I'm used to looking at uses the former. Do you feel strongly?


> +static void
> +trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
>  {
> -	if (s->flags & SLAB_TRACE) {
> -		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> -			s->name,
> -			alloc ? "alloc" : "free",
> -			object, page->inuse,
> -			page->freelist);
> +	if (likely(!(s->flags & SLAB_TRACE)))
> +		return;

I think most of your flow control changes are improvements (others even
more than this, but this is the first one so I comment here). Thanks.


> @@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
>  	}
>  	if (likely(object))
>  		slqb_stat_inc(l, ALLOC);
> +
>  	spin_unlock(&n->list_lock);
> +
>  	return object;
>  }
>  #endif

Whitespace, I never really know if I'm "doing it right" or not :) And
often it is easy to tell a badly wrong one, but harder to tell what is
better between two reasonable ones. But I guess I'm the same way with
paragraphs in my writing...


> @@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
>   *
>   * Must be called with interrupts disabled.
>   */
> -static __always_inline void *__slab_alloc(struct kmem_cache *s,
> -				gfp_t gfpflags, int node)
> +static __always_inline void *
> +__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
>  {
> -	void *object;
> -	struct kmem_cache_cpu *c;
>  	struct kmem_cache_list *l;
> +	struct kmem_cache_cpu *c;
> +	void *object;

Same with order of local variables. You like longest lines to
shortest I know. I think I vaguely try to arrange them from the
most important or high level "actor" to the least, and then in
order of when they get discovered/used.

For example, in the above function, "object" is the raison d'etre.
kmem_cache_cpu is found first, and from that, kmem_cache_list is
found. Which slightly explains the order.


> +static __always_inline void *
> +slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
>  {
> -	void *object;
>  	unsigned long flags;
> +	void *object;

And here, eg. flags comes last because mostly inconsequential to
the bigger picture.

Your method is easier though, I'll grant you that :)


>  static void init_kmem_cache_list(struct kmem_cache *s,
>  				struct kmem_cache_list *l)
>  {
> -	l->cache		= s;
> -	l->freelist.nr		= 0;
> -	l->freelist.head	= NULL;
> -	l->freelist.tail	= NULL;
> -	l->nr_partial		= 0;
> -	l->nr_slabs		= 0;
> +	l->cache		 = s;
> +	l->freelist.nr		 = 0;
> +	l->freelist.head	 = NULL;
> +	l->freelist.tail	 = NULL;
> +	l->nr_partial		 = 0;
> +	l->nr_slabs		 = 0;
>  	INIT_LIST_HEAD(&l->partial);

Hmm, we seem to have gathered an extra space...

>  
>  #ifdef CONFIG_SMP
> -	l->remote_free_check	= 0;
> +	l->remote_free_check	 = 0;
>  	spin_lock_init(&l->remote_free.lock);
> -	l->remote_free.list.nr	= 0;
> +	l->remote_free.list.nr	 = 0;
>  	l->remote_free.list.head = NULL;
>  	l->remote_free.list.tail = NULL;
>  #endif

... ah, to line up with this guy. TBH, I prefer not to religiously
line things up like this. If there is the odd long-line, just give
it the normal single space. I find it just keeps it easier to
maintain. Although you might counter that of course it is easier to
keep something clean if one relaxes their definition of "clean".


>  static s8 size_index[24] __cacheline_aligned = {
> -	3,	/* 8 */
> -	4,	/* 16 */
> -	5,	/* 24 */
> -	5,	/* 32 */
> -	6,	/* 40 */
> -	6,	/* 48 */
> -	6,	/* 56 */
> -	6,	/* 64 */
> +	 3,	/* 8 */
> +	 4,	/* 16 */
> +	 5,	/* 24 */
> +	 5,	/* 32 */
> +	 6,	/* 40 */
> +	 6,	/* 48 */
> +	 6,	/* 56 */
> +	 6,	/* 64 */

However justifying numbers, like this, I'm happy to do (may as well
align the numbers in the comments too while we're here).


> @@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_
>  
>  void *__kmalloc(size_t size, gfp_t flags)
>  {
> -	struct kmem_cache *s;
> +	struct kmem_cache *s = get_slab(size, flags);
>  
> -	s = get_slab(size, flags);
>  	if (unlikely(ZERO_OR_NULL_PTR(s)))
>  		return s;

I've got yet the same problem with these... I mostly try to avoid
doing this, although there are some cases where it works well
(eg. constants, or a simple assignment of an argument to a local).

At some point, you start putting real code in there, at which point
the space after the local vars doesn't seem to serve much purpose.
get_slab I feel logically belongs close to the subsequent check,
because that's basically sanitizing its return value / extracting
the error case from it and leaving the rest of the function to work
on the common case.


> -static int sysfs_available __read_mostly = 0;
> +static int sysfs_available __read_mostly;

These, I actually like initializing to zero explicitly. I'm pretty
sure gcc no longer makes it any more expensive than leaving out.
Yes of course everybody who knows C has to know this, but.... I
just don't feel much harm in leaving it.

Lots of good stuff, lots I'm on the fence with, some I dislike ;)
I'll concentrate on picking up the obvious ones, and get the bugs
fixed. Will see where the discussion goes with the rest.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  6:14         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  6:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> -				enum stat_item si)
> +static inline void
> +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
>  {

Hmm, I'm not entirely fond of this style. The former scales to longer lines
with just a single style change (putting args into new lines), wheras the
latter first moves its prefixes to a newline, then moves args as the
line grows even longer.

I guess it is a matter of taste, not wrong either way... but I think most
of the mm code I'm used to looking at uses the former. Do you feel strongly?


> +static void
> +trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
>  {
> -	if (s->flags & SLAB_TRACE) {
> -		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> -			s->name,
> -			alloc ? "alloc" : "free",
> -			object, page->inuse,
> -			page->freelist);
> +	if (likely(!(s->flags & SLAB_TRACE)))
> +		return;

I think most of your flow control changes are improvements (others even
more than this, but this is the first one so I comment here). Thanks.


> @@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
>  	}
>  	if (likely(object))
>  		slqb_stat_inc(l, ALLOC);
> +
>  	spin_unlock(&n->list_lock);
> +
>  	return object;
>  }
>  #endif

Whitespace, I never really know if I'm "doing it right" or not :) And
often it is easy to tell a badly wrong one, but harder to tell what is
better between two reasonable ones. But I guess I'm the same way with
paragraphs in my writing...


> @@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
>   *
>   * Must be called with interrupts disabled.
>   */
> -static __always_inline void *__slab_alloc(struct kmem_cache *s,
> -				gfp_t gfpflags, int node)
> +static __always_inline void *
> +__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
>  {
> -	void *object;
> -	struct kmem_cache_cpu *c;
>  	struct kmem_cache_list *l;
> +	struct kmem_cache_cpu *c;
> +	void *object;

Same with order of local variables. You like longest lines to
shortest I know. I think I vaguely try to arrange them from the
most important or high level "actor" to the least, and then in
order of when they get discovered/used.

For example, in the above function, "object" is the raison d'etre.
kmem_cache_cpu is found first, and from that, kmem_cache_list is
found. Which slightly explains the order.


> +static __always_inline void *
> +slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
>  {
> -	void *object;
>  	unsigned long flags;
> +	void *object;

And here, eg. flags comes last because mostly inconsequential to
the bigger picture.

Your method is easier though, I'll grant you that :)


>  static void init_kmem_cache_list(struct kmem_cache *s,
>  				struct kmem_cache_list *l)
>  {
> -	l->cache		= s;
> -	l->freelist.nr		= 0;
> -	l->freelist.head	= NULL;
> -	l->freelist.tail	= NULL;
> -	l->nr_partial		= 0;
> -	l->nr_slabs		= 0;
> +	l->cache		 = s;
> +	l->freelist.nr		 = 0;
> +	l->freelist.head	 = NULL;
> +	l->freelist.tail	 = NULL;
> +	l->nr_partial		 = 0;
> +	l->nr_slabs		 = 0;
>  	INIT_LIST_HEAD(&l->partial);

Hmm, we seem to have gathered an extra space...

>  
>  #ifdef CONFIG_SMP
> -	l->remote_free_check	= 0;
> +	l->remote_free_check	 = 0;
>  	spin_lock_init(&l->remote_free.lock);
> -	l->remote_free.list.nr	= 0;
> +	l->remote_free.list.nr	 = 0;
>  	l->remote_free.list.head = NULL;
>  	l->remote_free.list.tail = NULL;
>  #endif

... ah, to line up with this guy. TBH, I prefer not to religiously
line things up like this. If there is the odd long-line, just give
it the normal single space. I find it just keeps it easier to
maintain. Although you might counter that of course it is easier to
keep something clean if one relaxes their definition of "clean".


>  static s8 size_index[24] __cacheline_aligned = {
> -	3,	/* 8 */
> -	4,	/* 16 */
> -	5,	/* 24 */
> -	5,	/* 32 */
> -	6,	/* 40 */
> -	6,	/* 48 */
> -	6,	/* 56 */
> -	6,	/* 64 */
> +	 3,	/* 8 */
> +	 4,	/* 16 */
> +	 5,	/* 24 */
> +	 5,	/* 32 */
> +	 6,	/* 40 */
> +	 6,	/* 48 */
> +	 6,	/* 56 */
> +	 6,	/* 64 */

However justifying numbers, like this, I'm happy to do (may as well
align the numbers in the comments too while we're here).


> @@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_
>  
>  void *__kmalloc(size_t size, gfp_t flags)
>  {
> -	struct kmem_cache *s;
> +	struct kmem_cache *s = get_slab(size, flags);
>  
> -	s = get_slab(size, flags);
>  	if (unlikely(ZERO_OR_NULL_PTR(s)))
>  		return s;

I've got yet the same problem with these... I mostly try to avoid
doing this, although there are some cases where it works well
(eg. constants, or a simple assignment of an argument to a local).

At some point, you start putting real code in there, at which point
the space after the local vars doesn't seem to serve much purpose.
get_slab I feel logically belongs close to the subsequent check,
because that's basically sanitizing its return value / extracting
the error case from it and leaving the rest of the function to work
on the common case.


> -static int sysfs_available __read_mostly = 0;
> +static int sysfs_available __read_mostly;

These, I actually like initializing to zero explicitly. I'm pretty
sure gcc no longer makes it any more expensive than leaving out.
Yes of course everybody who knows C has to know this, but.... I
just don't feel much harm in leaving it.

Lots of good stuff, lots I'm on the fence with, some I dislike ;)
I'll concentrate on picking up the obvious ones, and get the bugs
fixed. Will see where the discussion goes with the rest.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  8:45   ` Zhang, Yanmin
@ 2009-01-23  9:00     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  9:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), 
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
> 
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
> 
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Booting with mem= is a good trick to create memoryless nodes easily.
Unfortunately it didn't trigger any bugs on my system, so I couldn't
actually verify that the fallback code solve your problem. Would
you be able to test with this updated patch (which also includes
Hugh's fix and some code style changes).

The other thing is that this bug has uncovered a little buglet in the
sysfs setup code: if it is unable to continue in a degraded mode after
the allocation failure, it should be using SLAB_PANIC.

Thanks,
Nick
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,289 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long	nr;
+	void 		**head;
+	void		**tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t	lock;
+	struct kmlist	list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+				/* Fastpath LIFO freelist of objects */
+	struct kmlist		freelist;
+#ifdef CONFIG_SMP
+				/* remote_free has reached a watermark */
+	int			remote_free_check;
+#endif
+				/* kmem_cache corresponding to this list */
+	struct kmem_cache	*cache;
+
+				/* Number of partial slabs (pages) */
+	unsigned long		nr_partial;
+
+				/* Slabs which have some free objects */
+	struct list_head	partial;
+
+				/* Total number of slabs allocated */
+	unsigned long		nr_slabs;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long		stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list	list;		/* List for node-local slabs */
+	unsigned int		colour_next;	/* Next colour offset to use */
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist		rlist;
+	struct kmem_cache_list	*remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list	list;
+	spinlock_t		list_lock;	/* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long	flags;
+	int		hiwater;	/* LIFO list high watermark */
+	int		freebatch;	/* LIFO freelist batch flush size */
+	int		objsize;	/* Size of object without meta data */
+	int		offset;		/* Free pointer offset. */
+	int		objects;	/* Number of objects in slab */
+
+	int		size;		/* Size of object including meta data */
+	int		order;		/* Allocation order */
+	gfp_t		allocflags;	/* gfp flags to use on allocation */
+	unsigned int	colour_range;	/* range of colour counter */
+	unsigned int	colour_off;	/* offset per colour */
+	void		(*ctor)(void *);
+
+	const char	*name;		/* Name (only for display!) */
+	struct list_head list;		/* List of slab caches */
+
+	int		align;		/* Alignment */
+	int		inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject	kobj;		/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node	*node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu	*cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu	cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ?	\
+				sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG
 
 choice
 	prompt "Choose SLAB allocator"
-	default SLUB
+	default SLQB
 	help
 	   This option allows to select a slab allocator.
 
@@ -827,6 +827,11 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+	help
+	  SLQB is a proposed new slab allocator.
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3509 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ *   they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ *   to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ *   a default closest home node via which it can use fastpath functions.
+ *   Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long	flags;		/* mandatory */
+			atomic_t	_count;		/* mandatory */
+			unsigned int	inuse;		/* Nr of objects */
+			struct kmem_cache_list *list;	/* Pointer to list */
+			void		 **freelist;	/* LIFO freelist */
+			union {
+				struct list_head lru;	/* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+static const int numa_platform = 0;
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+				enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+				enum stat_item si, unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+#define MAX_ERR_STR 100
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[MAX_ERR_STR];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size) {
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+	}
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+			char *fmt, ...)
+{
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE) {
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+	}
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+				void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset) {
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+	}
+
+	if (s->flags & SLAB_STORE_USER) {
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+	}
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON)) {
+			if (!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1))
+				return 0;
+
+			if (!check_bytes_and_report(s, page, p, "Poison",
+					p + s->objsize - 1, POISON_END, 1))
+				return 0;
+		}
+
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str) {
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+	}
+
+	if (*str == ',') {
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+	}
+
+	slqb_debug = 0;
+	if (*str == '-') {
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+	}
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+				unsigned long flags, const char *name,
+				void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object)
+{
+}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int free_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	return 1;
+}
+
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active)
+{
+	return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+
+static const int slqb_debug = 0;
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+					gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+				gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s,
+			struct kmem_cache_list *l, struct slqb_page *page,
+			void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+				void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s,
+					struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	spin_lock(&l->remote_free.lock);
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+
+	spin_unlock(&l->remote_free.lock);
+
+	VM_BUG_ON(!nr);
+
+	if (!l->freelist.nr) {
+		/* Get head hot for likely subsequent allocation or flush */
+		prefetchw(head);
+		l->freelist.head = head;
+	} else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+						struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+	void *object;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return page;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+	} else {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+		spin_unlock(&n->list_lock);
+#endif
+	}
+	VM_BUG_ON(!object);
+	return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static void *__remote_slab_alloc_node(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	if (unlikely(!n)) /* node has no memory */
+		return NULL;
+	l = &n->list;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			spin_unlock(&n->list_lock);
+			return __slab_alloc_page(s, gfpflags, node);
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfpflags);
+
+	object = __remote_slab_alloc_node(s, gfpflags, node);
+	if (likely(object || (gfpflags & __GFP_THISNODE)))
+		return object;
+
+	zonelist = node_zonelist(slab_node(current->mempolicy), gfpflags);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		if (!cpuset_zone_allowed_hardwall(zone, gfpflags))
+			continue;
+
+		node = zone_to_nid(zone);
+		object = __remote_slab_alloc_node(s, gfpflags, node);
+		if (likely(object))
+			return object;
+	}
+	return NULL;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+try_remote:
+		return __remote_slab_alloc(s, gfpflags, node);
+	}
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+			if (unlikely(!object))
+				goto try_remote;
+#endif
+			return object;
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, void *caller)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+				struct slqb_page *page, void *object,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+	} else {
+#ifdef CONFIG_NUMA
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+				unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	l->cache		= s;
+	l->freelist.nr		= 0;
+	l->freelist.head	= NULL;
+	l->freelist.tail	= NULL;
+	l->nr_partial		= 0;
+	l->nr_slabs		= 0;
+	INIT_LIST_HEAD(&l->partial);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check	= 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr	= 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next		= 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr		= 0;
+	c->rlist.head		= NULL;
+	c->rlist.tail		= NULL;
+	c->remote_cache_list	= NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s,
+				struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+				int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER) {
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+	}
+
+	if (flags & SLAB_RED_ZONE) {
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+	}
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size,
+				min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+			const char *name, size_t size, size_t align,
+			unsigned long flags, void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+				const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+
+	return 0;
+}
+device_initcall(cpucache_init);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	/* XXX: should release structures, see CPU offline comment */
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		if (s->node[nid]) /* could be lefover from last online */
+			continue;
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+	/*
+	 * All the ifdefs are rather ugly here, but it's just the setup code,
+	 * so it doesn't have to be too readable :)
+	 */
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+			kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+			sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+			sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+				"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+		size_t align, unsigned long flags)
+{
+	struct kmem_cache *tmp;
+
+	/*
+	 * Sanity checks... these are all serious usage bugs.
+	 */
+	if (!name || in_interrupt() || (size < sizeof(void *))) {
+		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+				name);
+		dump_stack();
+
+		return 0;
+	}
+
+	down_read(&slqb_lock);
+
+	list_for_each_entry(tmp, &slab_caches, list) {
+		char x;
+		int res;
+
+		/*
+		 * This happens when the module gets unloaded and doesn't
+		 * destroy its slab cache and no-one else reuses the vmalloc
+		 * area of the module.  Print a warning.
+		 */
+		res = probe_kernel_address(tmp->name, x);
+		if (res) {
+			printk(KERN_ERR
+			       "SLAB: cache with size %d has lost its name\n",
+			       tmp->size);
+			continue;
+		}
+
+		if (!strcmp(tmp->name, name)) {
+			printk(KERN_ERR
+			       "kmem_cache_create(): duplicate cache %s\n", name);
+			dump_stack();
+			up_read(&slqb_lock);
+
+			return 0;
+		}
+	}
+
+	up_read(&slqb_lock);
+
+	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
+	if (flags & SLAB_DESTROY_BY_RCU)
+		WARN_ON(flags & SLAB_POISON);
+
+	return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	if (!kmem_cache_create_ok(name, size, align, flags))
+		goto err;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			if (s->cpu_slab[cpu]) /* could be lefover last online */
+				continue;
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		/*
+		 * XXX: Freeing here doesn't work because objects can still be
+		 * on this CPU's list. periodic timer needs to check if a CPU
+		 * is offline and then try to cleanup from there. Same for node
+		 * offline.
+		 */
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+		gather->stats[i] += l->stats[i];
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+			stats->stats[i] += l->stats[i];
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+			stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+			slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+			stats.nr_slabs, 0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+			&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+
+	sysfs_available = 1;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+
+	up_write(&slqb_lock);
+
+	return 0;
+}
+device_initcall(slab_sysfs_init);
+
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  9:00     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  9:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> > 
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), 
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
> 
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
> 
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Booting with mem= is a good trick to create memoryless nodes easily.
Unfortunately it didn't trigger any bugs on my system, so I couldn't
actually verify that the fallback code solve your problem. Would
you be able to test with this updated patch (which also includes
Hugh's fix and some code style changes).

The other thing is that this bug has uncovered a little buglet in the
sysfs setup code: if it is unable to continue in a degraded mode after
the allocation failure, it should be using SLAB_PANIC.

Thanks,
Nick
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,289 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long	nr;
+	void 		**head;
+	void		**tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t	lock;
+	struct kmlist	list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+				/* Fastpath LIFO freelist of objects */
+	struct kmlist		freelist;
+#ifdef CONFIG_SMP
+				/* remote_free has reached a watermark */
+	int			remote_free_check;
+#endif
+				/* kmem_cache corresponding to this list */
+	struct kmem_cache	*cache;
+
+				/* Number of partial slabs (pages) */
+	unsigned long		nr_partial;
+
+				/* Slabs which have some free objects */
+	struct list_head	partial;
+
+				/* Total number of slabs allocated */
+	unsigned long		nr_slabs;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long		stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list	list;		/* List for node-local slabs */
+	unsigned int		colour_next;	/* Next colour offset to use */
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist		rlist;
+	struct kmem_cache_list	*remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list	list;
+	spinlock_t		list_lock;	/* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long	flags;
+	int		hiwater;	/* LIFO list high watermark */
+	int		freebatch;	/* LIFO freelist batch flush size */
+	int		objsize;	/* Size of object without meta data */
+	int		offset;		/* Free pointer offset. */
+	int		objects;	/* Number of objects in slab */
+
+	int		size;		/* Size of object including meta data */
+	int		order;		/* Allocation order */
+	gfp_t		allocflags;	/* gfp flags to use on allocation */
+	unsigned int	colour_range;	/* range of colour counter */
+	unsigned int	colour_off;	/* offset per colour */
+	void		(*ctor)(void *);
+
+	const char	*name;		/* Name (only for display!) */
+	struct list_head list;		/* List of slab caches */
+
+	int		align;		/* Alignment */
+	int		inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject	kobj;		/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node	*node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu	*cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu	cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ?	\
+				sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG
 
 choice
 	prompt "Choose SLAB allocator"
-	default SLUB
+	default SLQB
 	help
 	   This option allows to select a slab allocator.
 
@@ -827,6 +827,11 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+	help
+	  SLQB is a proposed new slab allocator.
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3509 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ *   they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ *   to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ *   a default closest home node via which it can use fastpath functions.
+ *   Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long	flags;		/* mandatory */
+			atomic_t	_count;		/* mandatory */
+			unsigned int	inuse;		/* Nr of objects */
+			struct kmem_cache_list *list;	/* Pointer to list */
+			void		 **freelist;	/* LIFO freelist */
+			union {
+				struct list_head lru;	/* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+static const int numa_platform = 0;
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+				enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+				enum stat_item si, unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+#define MAX_ERR_STR 100
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[MAX_ERR_STR];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size) {
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+	}
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+			char *fmt, ...)
+{
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE) {
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+	}
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+				void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset) {
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+	}
+
+	if (s->flags & SLAB_STORE_USER) {
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+	}
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON)) {
+			if (!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1))
+				return 0;
+
+			if (!check_bytes_and_report(s, page, p, "Poison",
+					p + s->objsize - 1, POISON_END, 1))
+				return 0;
+		}
+
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str) {
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+	}
+
+	if (*str == ',') {
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+	}
+
+	slqb_debug = 0;
+	if (*str == '-') {
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+	}
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+				unsigned long flags, const char *name,
+				void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object)
+{
+}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int free_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	return 1;
+}
+
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active)
+{
+	return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+
+static const int slqb_debug = 0;
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+					gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+				gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s,
+			struct kmem_cache_list *l, struct slqb_page *page,
+			void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+				void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s,
+					struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	spin_lock(&l->remote_free.lock);
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+
+	spin_unlock(&l->remote_free.lock);
+
+	VM_BUG_ON(!nr);
+
+	if (!l->freelist.nr) {
+		/* Get head hot for likely subsequent allocation or flush */
+		prefetchw(head);
+		l->freelist.head = head;
+	} else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+						struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+	void *object;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return page;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+	} else {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+		spin_unlock(&n->list_lock);
+#endif
+	}
+	VM_BUG_ON(!object);
+	return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static void *__remote_slab_alloc_node(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	if (unlikely(!n)) /* node has no memory */
+		return NULL;
+	l = &n->list;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			spin_unlock(&n->list_lock);
+			return __slab_alloc_page(s, gfpflags, node);
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfpflags);
+
+	object = __remote_slab_alloc_node(s, gfpflags, node);
+	if (likely(object || (gfpflags & __GFP_THISNODE)))
+		return object;
+
+	zonelist = node_zonelist(slab_node(current->mempolicy), gfpflags);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		if (!cpuset_zone_allowed_hardwall(zone, gfpflags))
+			continue;
+
+		node = zone_to_nid(zone);
+		object = __remote_slab_alloc_node(s, gfpflags, node);
+		if (likely(object))
+			return object;
+	}
+	return NULL;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+try_remote:
+		return __remote_slab_alloc(s, gfpflags, node);
+	}
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+			if (unlikely(!object))
+				goto try_remote;
+#endif
+			return object;
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, void *caller)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+				struct slqb_page *page, void *object,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+	} else {
+#ifdef CONFIG_NUMA
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+				unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	l->cache		= s;
+	l->freelist.nr		= 0;
+	l->freelist.head	= NULL;
+	l->freelist.tail	= NULL;
+	l->nr_partial		= 0;
+	l->nr_slabs		= 0;
+	INIT_LIST_HEAD(&l->partial);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check	= 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr	= 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next		= 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr		= 0;
+	c->rlist.head		= NULL;
+	c->rlist.tail		= NULL;
+	c->remote_cache_list	= NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s,
+				struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+				int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER) {
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+	}
+
+	if (flags & SLAB_RED_ZONE) {
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+	}
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size,
+				min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+			const char *name, size_t size, size_t align,
+			unsigned long flags, void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+				const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+
+	return 0;
+}
+device_initcall(cpucache_init);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	/* XXX: should release structures, see CPU offline comment */
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		if (s->node[nid]) /* could be lefover from last online */
+			continue;
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+	/*
+	 * All the ifdefs are rather ugly here, but it's just the setup code,
+	 * so it doesn't have to be too readable :)
+	 */
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+			kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+			sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+			sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+				"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+		size_t align, unsigned long flags)
+{
+	struct kmem_cache *tmp;
+
+	/*
+	 * Sanity checks... these are all serious usage bugs.
+	 */
+	if (!name || in_interrupt() || (size < sizeof(void *))) {
+		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+				name);
+		dump_stack();
+
+		return 0;
+	}
+
+	down_read(&slqb_lock);
+
+	list_for_each_entry(tmp, &slab_caches, list) {
+		char x;
+		int res;
+
+		/*
+		 * This happens when the module gets unloaded and doesn't
+		 * destroy its slab cache and no-one else reuses the vmalloc
+		 * area of the module.  Print a warning.
+		 */
+		res = probe_kernel_address(tmp->name, x);
+		if (res) {
+			printk(KERN_ERR
+			       "SLAB: cache with size %d has lost its name\n",
+			       tmp->size);
+			continue;
+		}
+
+		if (!strcmp(tmp->name, name)) {
+			printk(KERN_ERR
+			       "kmem_cache_create(): duplicate cache %s\n", name);
+			dump_stack();
+			up_read(&slqb_lock);
+
+			return 0;
+		}
+	}
+
+	up_read(&slqb_lock);
+
+	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
+	if (flags & SLAB_DESTROY_BY_RCU)
+		WARN_ON(flags & SLAB_POISON);
+
+	return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	if (!kmem_cache_create_ok(name, size, align, flags))
+		goto err;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			if (s->cpu_slab[cpu]) /* could be lefover last online */
+				continue;
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		/*
+		 * XXX: Freeing here doesn't work because objects can still be
+		 * on this CPU's list. periodic timer needs to check if a CPU
+		 * is offline and then try to cleanup from there. Same for node
+		 * offline.
+		 */
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+		gather->stats[i] += l->stats[i];
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+			stats->stats[i] += l->stats[i];
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+			stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+			slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+			stats.nr_slabs, 0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+			&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+
+	sysfs_available = 1;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+
+	up_write(&slqb_lock);
+
+	return 0;
+}
+device_initcall(slab_sysfs_init);
+
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-21 14:30 ` Nick Piggin
@ 2009-01-23  9:55   ` Andi Kleen
  -1 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23  9:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Nick Piggin <npiggin@suse.de> writes:

Not a full review, just some things i noticed.

The code is very readable thanks (that's imho the main reason slab.c
should go btw, it's really messy and hard to get through)

> Using lists rather than arrays can reduce the cacheline footprint. When moving
> objects around, SLQB can move a list of objects from one CPU to another by
> simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> can be touched during alloc/free. Newly freed objects tend to be cache hot,
> and newly allocated ones tend to soon be touched anyway, so often there is
> little cost to using metadata in the objects.

You're probably aware of that, but the obvious counter argument
is that for manipulating a single object a double linked
list will always require touching three cache lines
(prev, current, next), while an array access only a single one.
A possible alternative would be a list of shorter arrays.

> +	int objsize;		/* The size of an object without meta data */
> +	int offset;		/* Free pointer offset. */
> +	int objects;		/* Number of objects in slab */
> +
> +	int size;		/* The size of an object including meta data */
> +	int order;		/* Allocation order */
> +	gfp_t allocflags;	/* gfp flags to use on allocation */
> +	unsigned int colour_range;	/* range of colour counter */
> +	unsigned int colour_off;		/* offset per colour */
> +	void (*ctor)(void *);
> +
> +	const char *name;	/* Name (only for display!) */
> +	struct list_head list;	/* List of slab caches */
> +
> +	int align;		/* Alignment */
> +	int inuse;		/* Offset to metadata */

I suspect some of these fields could be short or char (E.g. alignment),
possibly lowering cache line impact.

> +
> +#ifdef CONFIG_SLQB_SYSFS
> +	struct kobject kobj;	/* For sysfs */
> +#endif
> +#ifdef CONFIG_NUMA
> +	struct kmem_cache_node *node[MAX_NUMNODES];
> +#endif
> +#ifdef CONFIG_SMP
> +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];

Those both really need to be dynamically allocated, otherwise
it wastes a lot of memory in the common case
(e.g. NR_CPUS==128 kernel on dual core system). And of course
on the proposed NR_CPUS==4096 kernels it becomes prohibitive.

You could use alloc_percpu? There's no alloc_pernode 
unfortunately, perhaps there should be one. 

> +#if L1_CACHE_BYTES < 64
> +	if (size > 64 && size <= 96)
> +		return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> +	if (size > 128 && size <= 192)
> +		return 2;
> +#endif
> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;

Have you looked into other binsizes?  iirc the original slab paper
mentioned that power of two is usually not the best.

> +	return -1;

> +}
> +
> +#ifdef CONFIG_ZONE_DMA
> +#define SLQB_DMA __GFP_DMA
> +#else
> +/* Disable "DMA slabs" */
> +#define SLQB_DMA (__force gfp_t)0
> +#endif
> +
> +/*
> + * Find the kmalloc slab cache for a given combination of allocation flags and
> + * size.

You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

> + */
> +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> +{
> +	int index = kmalloc_index(size);
> +
> +	if (unlikely(index == 0))
> +		return NULL;
> +
> +	if (likely(!(flags & SLQB_DMA)))
> +		return &kmalloc_caches[index];
> +	else
> +		return &kmalloc_caches_dma[index];

BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
warm that up again. That would lower the inline footprint.

> +#ifdef CONFIG_NUMA
> +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> +
> +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)

kmalloc_node should be infrequent, i suspect it can be safely out of lined.

> + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> + *   a default closest home node via which it can use fastpath functions.

FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
that too and be happy.

> + *   Perhaps it is not a big problem.
> + */
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.

Hopefully this works for the crash dumpers. Do they have a way to distingush
slub/slqb/slab kernels with different struct page usage?

> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

It would be cheaper if you put that as a flag into the kmem_caches flags, this
way you avoid an additional cache line touched.

> +static inline int slqb_page_to_nid(struct slqb_page *page)
> +{
> +	return page_to_nid(&page->page);
> +}

etc. you got a lot of wrappers...

> +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> +						unsigned int order)
> +{
> +	struct page *p;
> +
> +	if (nid == -1)
> +		p = alloc_pages(flags, order);
> +	else
> +		p = alloc_pages_node(nid, flags, order);

alloc_pages_nodes does that check anyways.


> +/* Not all arches define cache_line_size */
> +#ifndef cache_line_size
> +#define cache_line_size()	L1_CACHE_BYTES
> +#endif
> +

They should. better fix them?


> +
> +	/*
> +	 * Determine which debug features should be switched on
> +	 */

It would be nicer if you could use long options. At least for me
that would increase the probability that I could remember them
without having to look them up.

> +/*
> + * Allocate a new slab, set up its object list.
> + */
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +{
> +	struct slqb_page *page;
> +	void *start;
> +	void *last;
> +	void *p;
> +
> +	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> +
> +	page = allocate_slab(s,
> +		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> +	if (!page)
> +		goto out;
> +
> +	page->flags |= PG_SLQB_BIT;
> +
> +	start = page_address(&page->page);
> +
> +	if (unlikely(slab_poison(s)))
> +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> +
> +	start += colour;

One thing i was wondering. Did you try to disable the colouring and see
if it makes much difference on modern systems? They tend to have either
larger caches or higher associativity caches.

Or perhaps it could be made optional based on CPU type?


> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
> +{
> +	struct slqb_page *page;
> +	struct kmem_cache_list *l;
> +	struct kmem_cache_cpu *c;
> +	unsigned int colour;
> +	void *object;
> +
> +	c = get_cpu_slab(s, smp_processor_id());
> +	colour = c->colour_next;
> +	c->colour_next += s->colour_off;
> +	if (c->colour_next >= s->colour_range)
> +		c->colour_next = 0;
> +
> +	/* XXX: load any partial? */
> +
> +	/* Caller handles __GFP_ZERO */
> +	gfpflags &= ~__GFP_ZERO;
> +
> +	if (gfpflags & __GFP_WAIT)
> +		local_irq_enable();

At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
path when __GFP_WAIT is set (because storing the eflags is very expensive there)

> +
> +again:
> +	local_irq_save(flags);
> +	object = __slab_alloc(s, gfpflags, node);
> +	local_irq_restore(flags);
> +
> +	if (unlikely(slab_debug(s)) && likely(object)) {

AFAIK gcc cannot process multiple likelys in a single condition.

> +/* Initial slabs */
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> +#endif
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> +#endif
> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache kmem_cpu_cache;
> +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> +#endif
> +#endif
> +
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache kmem_node_cache;
> +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> +#endif

That all needs fixing too of course.

> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
> +{
> +	struct kmem_cache_cpu *c;
> +
> +	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
> +	if (!c)
> +		return NULL;
> +
> +	init_kmem_cache_cpu(s, c);
> +	return c;
> +}
> +
> +static void free_kmem_cache_cpus(struct kmem_cache *s)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {

Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

> +static void cache_trim_worker(struct work_struct *w)
> +{
> +	struct delayed_work *work =
> +		container_of(w, struct delayed_work, work);
> +	struct kmem_cache *s;
> +	int node;
> +
> +	if (!down_read_trylock(&slqb_lock))
> +		goto out;

No counter for this?

> +
> +	/*
> +	 * We are bringing a node online. No memory is availabe yet. We must
> +	 * allocate a kmem_cache_node structure in order to bring the node
> +	 * online.
> +	 */
> +	down_read(&slqb_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		/*
> +		 * XXX: kmem_cache_alloc_node will fallback to other nodes
> +		 *      since memory is not yet available from the node that
> +		 *      is brought up.
> +		 */
> +		if (s->node[nid]) /* could be lefover from last online */
> +			continue;
> +		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> +		if (!n) {
> +			ret = -ENOMEM;

Surely that should panic? I don't think a slab less node will
be very useful later.

> +#ifdef CONFIG_SLQB_SYSFS
> +/*
> + * sysfs API
> + */
> +#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
> +#define to_slab(n) container_of(n, struct kmem_cache, kobj);
> +
> +struct slab_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kmem_cache *s, char *buf);
> +	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
> +};
> +
> +#define SLAB_ATTR_RO(_name) \
> +	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
> +
> +#define SLAB_ATTR(_name) \
> +	static struct slab_attribute _name##_attr =  \
> +	__ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
> +{
> +	return sprintf(buf, "%d\n", s->size);
> +}
> +SLAB_ATTR_RO(slab_size);
> +
> +static ssize_t align_show(struct kmem_cache *s, char *buf)
> +{
> +	return sprintf(buf, "%d\n", s->align);
> +}
> +SLAB_ATTR_RO(align);
> +

When you map back to the attribute you can use a index into a table
for the field, saving that many functions?

> +#define STAT_ATTR(si, text) 					\
> +static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
> +{								\
> +	return show_stat(s, buf, si);				\
> +}								\
> +SLAB_ATTR_RO(text);						\
> +
> +STAT_ATTR(ALLOC, alloc);
> +STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
> +STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
> +STAT_ATTR(FREE, free);
> +STAT_ATTR(FREE_REMOTE, free_remote);
> +STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
> +STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
> +STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
> +STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
> +STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
> +STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
> +STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
> +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);

This really should be table driven, shouldn't it? That would give much
smaller code.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  9:55   ` Andi Kleen
  0 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23  9:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Nick Piggin <npiggin@suse.de> writes:

Not a full review, just some things i noticed.

The code is very readable thanks (that's imho the main reason slab.c
should go btw, it's really messy and hard to get through)

> Using lists rather than arrays can reduce the cacheline footprint. When moving
> objects around, SLQB can move a list of objects from one CPU to another by
> simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> can be touched during alloc/free. Newly freed objects tend to be cache hot,
> and newly allocated ones tend to soon be touched anyway, so often there is
> little cost to using metadata in the objects.

You're probably aware of that, but the obvious counter argument
is that for manipulating a single object a double linked
list will always require touching three cache lines
(prev, current, next), while an array access only a single one.
A possible alternative would be a list of shorter arrays.

> +	int objsize;		/* The size of an object without meta data */
> +	int offset;		/* Free pointer offset. */
> +	int objects;		/* Number of objects in slab */
> +
> +	int size;		/* The size of an object including meta data */
> +	int order;		/* Allocation order */
> +	gfp_t allocflags;	/* gfp flags to use on allocation */
> +	unsigned int colour_range;	/* range of colour counter */
> +	unsigned int colour_off;		/* offset per colour */
> +	void (*ctor)(void *);
> +
> +	const char *name;	/* Name (only for display!) */
> +	struct list_head list;	/* List of slab caches */
> +
> +	int align;		/* Alignment */
> +	int inuse;		/* Offset to metadata */

I suspect some of these fields could be short or char (E.g. alignment),
possibly lowering cache line impact.

> +
> +#ifdef CONFIG_SLQB_SYSFS
> +	struct kobject kobj;	/* For sysfs */
> +#endif
> +#ifdef CONFIG_NUMA
> +	struct kmem_cache_node *node[MAX_NUMNODES];
> +#endif
> +#ifdef CONFIG_SMP
> +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];

Those both really need to be dynamically allocated, otherwise
it wastes a lot of memory in the common case
(e.g. NR_CPUS==128 kernel on dual core system). And of course
on the proposed NR_CPUS==4096 kernels it becomes prohibitive.

You could use alloc_percpu? There's no alloc_pernode 
unfortunately, perhaps there should be one. 

> +#if L1_CACHE_BYTES < 64
> +	if (size > 64 && size <= 96)
> +		return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> +	if (size > 128 && size <= 192)
> +		return 2;
> +#endif
> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;

Have you looked into other binsizes?  iirc the original slab paper
mentioned that power of two is usually not the best.

> +	return -1;

> +}
> +
> +#ifdef CONFIG_ZONE_DMA
> +#define SLQB_DMA __GFP_DMA
> +#else
> +/* Disable "DMA slabs" */
> +#define SLQB_DMA (__force gfp_t)0
> +#endif
> +
> +/*
> + * Find the kmalloc slab cache for a given combination of allocation flags and
> + * size.

You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

> + */
> +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> +{
> +	int index = kmalloc_index(size);
> +
> +	if (unlikely(index == 0))
> +		return NULL;
> +
> +	if (likely(!(flags & SLQB_DMA)))
> +		return &kmalloc_caches[index];
> +	else
> +		return &kmalloc_caches_dma[index];

BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
warm that up again. That would lower the inline footprint.

> +#ifdef CONFIG_NUMA
> +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> +
> +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)

kmalloc_node should be infrequent, i suspect it can be safely out of lined.

> + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> + *   a default closest home node via which it can use fastpath functions.

FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
that too and be happy.

> + *   Perhaps it is not a big problem.
> + */
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.

Hopefully this works for the crash dumpers. Do they have a way to distingush
slub/slqb/slab kernels with different struct page usage?

> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

It would be cheaper if you put that as a flag into the kmem_caches flags, this
way you avoid an additional cache line touched.

> +static inline int slqb_page_to_nid(struct slqb_page *page)
> +{
> +	return page_to_nid(&page->page);
> +}

etc. you got a lot of wrappers...

> +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> +						unsigned int order)
> +{
> +	struct page *p;
> +
> +	if (nid == -1)
> +		p = alloc_pages(flags, order);
> +	else
> +		p = alloc_pages_node(nid, flags, order);

alloc_pages_nodes does that check anyways.


> +/* Not all arches define cache_line_size */
> +#ifndef cache_line_size
> +#define cache_line_size()	L1_CACHE_BYTES
> +#endif
> +

They should. better fix them?


> +
> +	/*
> +	 * Determine which debug features should be switched on
> +	 */

It would be nicer if you could use long options. At least for me
that would increase the probability that I could remember them
without having to look them up.

> +/*
> + * Allocate a new slab, set up its object list.
> + */
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +{
> +	struct slqb_page *page;
> +	void *start;
> +	void *last;
> +	void *p;
> +
> +	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> +
> +	page = allocate_slab(s,
> +		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> +	if (!page)
> +		goto out;
> +
> +	page->flags |= PG_SLQB_BIT;
> +
> +	start = page_address(&page->page);
> +
> +	if (unlikely(slab_poison(s)))
> +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> +
> +	start += colour;

One thing i was wondering. Did you try to disable the colouring and see
if it makes much difference on modern systems? They tend to have either
larger caches or higher associativity caches.

Or perhaps it could be made optional based on CPU type?


> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
> +{
> +	struct slqb_page *page;
> +	struct kmem_cache_list *l;
> +	struct kmem_cache_cpu *c;
> +	unsigned int colour;
> +	void *object;
> +
> +	c = get_cpu_slab(s, smp_processor_id());
> +	colour = c->colour_next;
> +	c->colour_next += s->colour_off;
> +	if (c->colour_next >= s->colour_range)
> +		c->colour_next = 0;
> +
> +	/* XXX: load any partial? */
> +
> +	/* Caller handles __GFP_ZERO */
> +	gfpflags &= ~__GFP_ZERO;
> +
> +	if (gfpflags & __GFP_WAIT)
> +		local_irq_enable();

At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
path when __GFP_WAIT is set (because storing the eflags is very expensive there)

> +
> +again:
> +	local_irq_save(flags);
> +	object = __slab_alloc(s, gfpflags, node);
> +	local_irq_restore(flags);
> +
> +	if (unlikely(slab_debug(s)) && likely(object)) {

AFAIK gcc cannot process multiple likelys in a single condition.

> +/* Initial slabs */
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> +#endif
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> +#endif
> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache kmem_cpu_cache;
> +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> +#endif
> +#endif
> +
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache kmem_node_cache;
> +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> +#endif

That all needs fixing too of course.

> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
> +{
> +	struct kmem_cache_cpu *c;
> +
> +	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
> +	if (!c)
> +		return NULL;
> +
> +	init_kmem_cache_cpu(s, c);
> +	return c;
> +}
> +
> +static void free_kmem_cache_cpus(struct kmem_cache *s)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {

Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

> +static void cache_trim_worker(struct work_struct *w)
> +{
> +	struct delayed_work *work =
> +		container_of(w, struct delayed_work, work);
> +	struct kmem_cache *s;
> +	int node;
> +
> +	if (!down_read_trylock(&slqb_lock))
> +		goto out;

No counter for this?

> +
> +	/*
> +	 * We are bringing a node online. No memory is availabe yet. We must
> +	 * allocate a kmem_cache_node structure in order to bring the node
> +	 * online.
> +	 */
> +	down_read(&slqb_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		/*
> +		 * XXX: kmem_cache_alloc_node will fallback to other nodes
> +		 *      since memory is not yet available from the node that
> +		 *      is brought up.
> +		 */
> +		if (s->node[nid]) /* could be lefover from last online */
> +			continue;
> +		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> +		if (!n) {
> +			ret = -ENOMEM;

Surely that should panic? I don't think a slab less node will
be very useful later.

> +#ifdef CONFIG_SLQB_SYSFS
> +/*
> + * sysfs API
> + */
> +#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
> +#define to_slab(n) container_of(n, struct kmem_cache, kobj);
> +
> +struct slab_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kmem_cache *s, char *buf);
> +	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
> +};
> +
> +#define SLAB_ATTR_RO(_name) \
> +	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
> +
> +#define SLAB_ATTR(_name) \
> +	static struct slab_attribute _name##_attr =  \
> +	__ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
> +{
> +	return sprintf(buf, "%d\n", s->size);
> +}
> +SLAB_ATTR_RO(slab_size);
> +
> +static ssize_t align_show(struct kmem_cache *s, char *buf)
> +{
> +	return sprintf(buf, "%d\n", s->align);
> +}
> +SLAB_ATTR_RO(align);
> +

When you map back to the attribute you can use a index into a table
for the field, saving that many functions?

> +#define STAT_ATTR(si, text) 					\
> +static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
> +{								\
> +	return show_stat(s, buf, si);				\
> +}								\
> +SLAB_ATTR_RO(text);						\
> +
> +STAT_ATTR(ALLOC, alloc);
> +STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
> +STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
> +STAT_ATTR(FREE, free);
> +STAT_ATTR(FREE_REMOTE, free_remote);
> +STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
> +STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
> +STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
> +STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
> +STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
> +STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
> +STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
> +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);

This really should be table driven, shouldn't it? That would give much
smaller code.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  9:55   ` Andi Kleen
@ 2009-01-23 10:13     ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Andi,

On Fri, 2009-01-23 at 10:55 +0100, Andi Kleen wrote:
> > +#if L1_CACHE_BYTES < 64
> > +	if (size > 64 && size <= 96)
> > +		return 1;
> > +#endif
> > +#if L1_CACHE_BYTES < 128
> > +	if (size > 128 && size <= 192)
> > +		return 2;
> > +#endif
> > +	if (size <=	  8) return 3;
> > +	if (size <=	 16) return 4;
> > +	if (size <=	 32) return 5;
> > +	if (size <=	 64) return 6;
> > +	if (size <=	128) return 7;
> > +	if (size <=	256) return 8;
> > +	if (size <=	512) return 9;
> > +	if (size <=       1024) return 10;
> > +	if (size <=   2 * 1024) return 11;
> > +	if (size <=   4 * 1024) return 12;
> > +	if (size <=   8 * 1024) return 13;
> > +	if (size <=  16 * 1024) return 14;
> > +	if (size <=  32 * 1024) return 15;
> > +	if (size <=  64 * 1024) return 16;
> > +	if (size <= 128 * 1024) return 17;
> > +	if (size <= 256 * 1024) return 18;
> > +	if (size <= 512 * 1024) return 19;
> > +	if (size <= 1024 * 1024) return 20;
> > +	if (size <=  2 * 1024 * 1024) return 21;
> 
> Have you looked into other binsizes?  iirc the original slab paper
> mentioned that power of two is usually not the best.

Judging by the limited boot-time testing I've done with kmemtrace, the
bulk of kmalloc() allocations are under 64 bytes or so and actually a
pretty ok fit with the current sizes. The badly fitting objects are
usually very big and of different sizes (so they won't share a cache
easily) so I'm not expecting big gains from non-power of two sizes.

			Pekka


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 10:13     ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Andi,

On Fri, 2009-01-23 at 10:55 +0100, Andi Kleen wrote:
> > +#if L1_CACHE_BYTES < 64
> > +	if (size > 64 && size <= 96)
> > +		return 1;
> > +#endif
> > +#if L1_CACHE_BYTES < 128
> > +	if (size > 128 && size <= 192)
> > +		return 2;
> > +#endif
> > +	if (size <=	  8) return 3;
> > +	if (size <=	 16) return 4;
> > +	if (size <=	 32) return 5;
> > +	if (size <=	 64) return 6;
> > +	if (size <=	128) return 7;
> > +	if (size <=	256) return 8;
> > +	if (size <=	512) return 9;
> > +	if (size <=       1024) return 10;
> > +	if (size <=   2 * 1024) return 11;
> > +	if (size <=   4 * 1024) return 12;
> > +	if (size <=   8 * 1024) return 13;
> > +	if (size <=  16 * 1024) return 14;
> > +	if (size <=  32 * 1024) return 15;
> > +	if (size <=  64 * 1024) return 16;
> > +	if (size <= 128 * 1024) return 17;
> > +	if (size <= 256 * 1024) return 18;
> > +	if (size <= 512 * 1024) return 19;
> > +	if (size <= 1024 * 1024) return 20;
> > +	if (size <=  2 * 1024 * 1024) return 21;
> 
> Have you looked into other binsizes?  iirc the original slab paper
> mentioned that power of two is usually not the best.

Judging by the limited boot-time testing I've done with kmemtrace, the
bulk of kmalloc() allocations are under 64 bytes or so and actually a
pretty ok fit with the current sizes. The badly fitting objects are
usually very big and of different sizes (so they won't share a cache
easily) so I'm not expecting big gains from non-power of two sizes.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  9:55   ` Andi Kleen
@ 2009-01-23 11:25     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 11:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> Not a full review, just some things i noticed.
> 
> The code is very readable thanks (that's imho the main reason slab.c
> should go btw, it's really messy and hard to get through)

Thanks, appreciated. It is very helpful.

 
> > Using lists rather than arrays can reduce the cacheline footprint. When moving
> > objects around, SLQB can move a list of objects from one CPU to another by
> > simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> > SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> > can be touched during alloc/free. Newly freed objects tend to be cache hot,
> > and newly allocated ones tend to soon be touched anyway, so often there is
> > little cost to using metadata in the objects.
> 
> You're probably aware of that, but the obvious counter argument
> is that for manipulating a single object a double linked
> list will always require touching three cache lines
> (prev, current, next), while an array access only a single one.
> A possible alternative would be a list of shorter arrays.

That's true, but SLQB doesn't use double linked lists, but single.
An allocation needs to load a "head" pointer to the first object, then
load a "next" pointer from that object and assign it to "head". The
2nd load touches memory which should be subsequently touched bythe
caller anyway. A free just has to assign a pointer in the to-be-freed
object to point to the old head, and then update the head to the new
object. So this 1st touch should usually be cache hot memory.

But yes there are situations where SLAB scheme could result in
fewer cache misses. I haven't yet noticed it is a problem.


> > +	const char *name;	/* Name (only for display!) */
> > +	struct list_head list;	/* List of slab caches */
> > +
> > +	int align;		/* Alignment */
> > +	int inuse;		/* Offset to metadata */
> 
> I suspect some of these fields could be short or char (E.g. alignment),
> possibly lowering cache line impact.

Good point. I'll have to do a pass through all structures and
make sure sizes and alignments etc are optimal. I have somewhat
ordered it eg. so that LIFO freelist allocations only have to
touch the first few fields in structures, then partial page
list allocations touch the next few, then page allocator etc.

But that might have gone out of date a little bit.


> > +#ifdef CONFIG_SLQB_SYSFS
> > +	struct kobject kobj;	/* For sysfs */
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > +#endif
> > +#ifdef CONFIG_SMP
> > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> 
> Those both really need to be dynamically allocated, otherwise
> it wastes a lot of memory in the common case
> (e.g. NR_CPUS==128 kernel on dual core system). And of course
> on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> 
> You could use alloc_percpu? There's no alloc_pernode 
> unfortunately, perhaps there should be one. 

cpu_slab is dynamically allocated, by just changing the size of
the kmem_cache cache at boot time. Probably the best way would
be to have dynamic cpu and node allocs for them, I agree.

Any plans for an alloc_pernode?


> > +	if (size <=  2 * 1024 * 1024) return 21;
> 
> Have you looked into other binsizes?  iirc the original slab paper
> mentioned that power of two is usually not the best.

No I haven't. Although I have been spending most effort at this
point just to improve SLQB versus the other allocators without
changing things like this. But it would be fine to investigate
when SLQB is more mature or for somebody else to look at it.

> > +/*
> > + * Find the kmalloc slab cache for a given combination of allocation flags and
> > + * size.
> 
> You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

OK. It's not meant to be used outside slqb_def.h, however.


> > +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> > +{
> > +	int index = kmalloc_index(size);
> > +
> > +	if (unlikely(index == 0))
> > +		return NULL;
> > +
> > +	if (likely(!(flags & SLQB_DMA)))
> > +		return &kmalloc_caches[index];
> > +	else
> > +		return &kmalloc_caches_dma[index];
> 
> BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
> warm that up again. That would lower the inline footprint.

That would be excellent. It would also reduce constant data overheads
for SLAB and SLQB, and some nasty code from SLUB.


> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> 
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm... I wonder how much it increases code size...


> > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > + *   a default closest home node via which it can use fastpath functions.
> 
> FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> that too and be happy.

What if the node is possible but not currently online?

 
> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
> 
> Hopefully this works for the crash dumpers. Do they have a way to distingush
> slub/slqb/slab kernels with different struct page usage?

Beyond looking at configs or hacks like looking at symbols, I don't
think so... It probably should go into vermagic I guess.


> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
> 
> It would be cheaper if you put that as a flag into the kmem_caches flags, this
> way you avoid an additional cache line touched.

Ok, that works.

 
> > +static inline int slqb_page_to_nid(struct slqb_page *page)
> > +{
> > +	return page_to_nid(&page->page);
> > +}
> 
> etc. you got a lot of wrappers...

I think they're not too bad though.

 
> > +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> > +						unsigned int order)
> > +{
> > +	struct page *p;
> > +
> > +	if (nid == -1)
> > +		p = alloc_pages(flags, order);
> > +	else
> > +		p = alloc_pages_node(nid, flags, order);
> 
> alloc_pages_nodes does that check anyways.

OK, I rip out that wrapper completely.


> > +/* Not all arches define cache_line_size */
> > +#ifndef cache_line_size
> > +#define cache_line_size()	L1_CACHE_BYTES
> > +#endif
> > +
> 
> They should. better fix them?

git grep -l -e cache_line_size arch/ | egrep '\.h$'

Only ia64, mips, powerpc, sparc, x86...

> > +	/*
> > +	 * Determine which debug features should be switched on
> > +	 */
> 
> It would be nicer if you could use long options. At least for me
> that would increase the probability that I could remember them
> without having to look them up.

I haven't looked closely at the debug code which is mostly straight
out of SLUB and minimal changes to get it working. Of course it is
very important, but useless if the core allocator isn't good. I
also don't want to diverge from SLUB in these areas if possible until
we reduce the number of allocators in the tree...

Long options is probably not a bad idea, though.


> > +	if (unlikely(slab_poison(s)))
> > +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > +
> > +	start += colour;
> 
> One thing i was wondering. Did you try to disable the colouring and see
> if it makes much difference on modern systems? They tend to have either
> larger caches or higher associativity caches.

I have tried, but I don't think I found a test where it made a
statistically significant difference. It is not very costly to
implement, though.

 
> Or perhaps it could be made optional based on CPU type?

It could easily be changed, yes.


 
> > +
> > +again:
> > +	local_irq_save(flags);
> > +	object = __slab_alloc(s, gfpflags, node);
> > +	local_irq_restore(flags);
> 
> At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
> path when __GFP_WAIT is set (because storing the eflags is very expensive there)

That's a good point, although also something trivially applicable to
all allocators and as such I prefer not to add such differences to
the SLQB patch if we are going into an evaluation phase.


> > +/* Initial slabs */
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> > +#endif
> > +
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache kmem_cpu_cache;
> > +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> > +#endif
> > +#endif
> > +
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache kmem_node_cache;
> > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > +#endif
> 
> That all needs fixing too of course.

Hmm. I was hoping it could stay simple as it is just a static constant
(for a given NR_CPUS) overhead. I wonder if bootmem is still up here?
How fine grained is it these days? 

Could bite the bullet and do a multi-stage bootstap like SLUB, but I
want to try avoiding that (but init code is also of course much less
important than core code and total overheads). 


> > +static void free_kmem_cache_cpus(struct kmem_cache *s)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> 
> Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

I think you're right.

 
> > +static void cache_trim_worker(struct work_struct *w)
> > +{
> > +	struct delayed_work *work =
> > +		container_of(w, struct delayed_work, work);
> > +	struct kmem_cache *s;
> > +	int node;
> > +
> > +	if (!down_read_trylock(&slqb_lock))
> > +		goto out;
> 
> No counter for this?

It's quite unimportant. It will only race with creating or destroying
actual kmem caches, and cache trimming is infrequent too.


> > +	down_read(&slqb_lock);
> > +	list_for_each_entry(s, &slab_caches, list) {
> > +		/*
> > +		 * XXX: kmem_cache_alloc_node will fallback to other nodes
> > +		 *      since memory is not yet available from the node that
> > +		 *      is brought up.
> > +		 */
> > +		if (s->node[nid]) /* could be lefover from last online */
> > +			continue;
> > +		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> > +		if (!n) {
> > +			ret = -ENOMEM;
> 
> Surely that should panic? I don't think a slab less node will
> be very useful later.

Returning error here I think will just fail the online operation?
Better than a panic :)


> > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > +{
> > +	return sprintf(buf, "%d\n", s->align);
> > +}
> > +SLAB_ATTR_RO(align);
> > +
> 
> When you map back to the attribute you can use a index into a table
> for the field, saving that many functions?
> 
> > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
> 
> This really should be table driven, shouldn't it? That would give much
> smaller code.

Tables probably would help. I will keep it close to SLUB for now,
though.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 11:25     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 11:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> Not a full review, just some things i noticed.
> 
> The code is very readable thanks (that's imho the main reason slab.c
> should go btw, it's really messy and hard to get through)

Thanks, appreciated. It is very helpful.

 
> > Using lists rather than arrays can reduce the cacheline footprint. When moving
> > objects around, SLQB can move a list of objects from one CPU to another by
> > simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> > SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> > can be touched during alloc/free. Newly freed objects tend to be cache hot,
> > and newly allocated ones tend to soon be touched anyway, so often there is
> > little cost to using metadata in the objects.
> 
> You're probably aware of that, but the obvious counter argument
> is that for manipulating a single object a double linked
> list will always require touching three cache lines
> (prev, current, next), while an array access only a single one.
> A possible alternative would be a list of shorter arrays.

That's true, but SLQB doesn't use double linked lists, but single.
An allocation needs to load a "head" pointer to the first object, then
load a "next" pointer from that object and assign it to "head". The
2nd load touches memory which should be subsequently touched bythe
caller anyway. A free just has to assign a pointer in the to-be-freed
object to point to the old head, and then update the head to the new
object. So this 1st touch should usually be cache hot memory.

But yes there are situations where SLAB scheme could result in
fewer cache misses. I haven't yet noticed it is a problem.


> > +	const char *name;	/* Name (only for display!) */
> > +	struct list_head list;	/* List of slab caches */
> > +
> > +	int align;		/* Alignment */
> > +	int inuse;		/* Offset to metadata */
> 
> I suspect some of these fields could be short or char (E.g. alignment),
> possibly lowering cache line impact.

Good point. I'll have to do a pass through all structures and
make sure sizes and alignments etc are optimal. I have somewhat
ordered it eg. so that LIFO freelist allocations only have to
touch the first few fields in structures, then partial page
list allocations touch the next few, then page allocator etc.

But that might have gone out of date a little bit.


> > +#ifdef CONFIG_SLQB_SYSFS
> > +	struct kobject kobj;	/* For sysfs */
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > +#endif
> > +#ifdef CONFIG_SMP
> > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> 
> Those both really need to be dynamically allocated, otherwise
> it wastes a lot of memory in the common case
> (e.g. NR_CPUS==128 kernel on dual core system). And of course
> on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> 
> You could use alloc_percpu? There's no alloc_pernode 
> unfortunately, perhaps there should be one. 

cpu_slab is dynamically allocated, by just changing the size of
the kmem_cache cache at boot time. Probably the best way would
be to have dynamic cpu and node allocs for them, I agree.

Any plans for an alloc_pernode?


> > +	if (size <=  2 * 1024 * 1024) return 21;
> 
> Have you looked into other binsizes?  iirc the original slab paper
> mentioned that power of two is usually not the best.

No I haven't. Although I have been spending most effort at this
point just to improve SLQB versus the other allocators without
changing things like this. But it would be fine to investigate
when SLQB is more mature or for somebody else to look at it.

> > +/*
> > + * Find the kmalloc slab cache for a given combination of allocation flags and
> > + * size.
> 
> You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

OK. It's not meant to be used outside slqb_def.h, however.


> > +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> > +{
> > +	int index = kmalloc_index(size);
> > +
> > +	if (unlikely(index == 0))
> > +		return NULL;
> > +
> > +	if (likely(!(flags & SLQB_DMA)))
> > +		return &kmalloc_caches[index];
> > +	else
> > +		return &kmalloc_caches_dma[index];
> 
> BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
> warm that up again. That would lower the inline footprint.

That would be excellent. It would also reduce constant data overheads
for SLAB and SLQB, and some nasty code from SLUB.


> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> 
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm... I wonder how much it increases code size...


> > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > + *   a default closest home node via which it can use fastpath functions.
> 
> FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> that too and be happy.

What if the node is possible but not currently online?

 
> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
> 
> Hopefully this works for the crash dumpers. Do they have a way to distingush
> slub/slqb/slab kernels with different struct page usage?

Beyond looking at configs or hacks like looking at symbols, I don't
think so... It probably should go into vermagic I guess.


> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
> 
> It would be cheaper if you put that as a flag into the kmem_caches flags, this
> way you avoid an additional cache line touched.

Ok, that works.

 
> > +static inline int slqb_page_to_nid(struct slqb_page *page)
> > +{
> > +	return page_to_nid(&page->page);
> > +}
> 
> etc. you got a lot of wrappers...

I think they're not too bad though.

 
> > +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> > +						unsigned int order)
> > +{
> > +	struct page *p;
> > +
> > +	if (nid == -1)
> > +		p = alloc_pages(flags, order);
> > +	else
> > +		p = alloc_pages_node(nid, flags, order);
> 
> alloc_pages_nodes does that check anyways.

OK, I rip out that wrapper completely.


> > +/* Not all arches define cache_line_size */
> > +#ifndef cache_line_size
> > +#define cache_line_size()	L1_CACHE_BYTES
> > +#endif
> > +
> 
> They should. better fix them?

git grep -l -e cache_line_size arch/ | egrep '\.h$'

Only ia64, mips, powerpc, sparc, x86...

> > +	/*
> > +	 * Determine which debug features should be switched on
> > +	 */
> 
> It would be nicer if you could use long options. At least for me
> that would increase the probability that I could remember them
> without having to look them up.

I haven't looked closely at the debug code which is mostly straight
out of SLUB and minimal changes to get it working. Of course it is
very important, but useless if the core allocator isn't good. I
also don't want to diverge from SLUB in these areas if possible until
we reduce the number of allocators in the tree...

Long options is probably not a bad idea, though.


> > +	if (unlikely(slab_poison(s)))
> > +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > +
> > +	start += colour;
> 
> One thing i was wondering. Did you try to disable the colouring and see
> if it makes much difference on modern systems? They tend to have either
> larger caches or higher associativity caches.

I have tried, but I don't think I found a test where it made a
statistically significant difference. It is not very costly to
implement, though.

 
> Or perhaps it could be made optional based on CPU type?

It could easily be changed, yes.


 
> > +
> > +again:
> > +	local_irq_save(flags);
> > +	object = __slab_alloc(s, gfpflags, node);
> > +	local_irq_restore(flags);
> 
> At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
> path when __GFP_WAIT is set (because storing the eflags is very expensive there)

That's a good point, although also something trivially applicable to
all allocators and as such I prefer not to add such differences to
the SLQB patch if we are going into an evaluation phase.


> > +/* Initial slabs */
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> > +#endif
> > +
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache kmem_cpu_cache;
> > +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> > +#endif
> > +#endif
> > +
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache kmem_node_cache;
> > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > +#endif
> 
> That all needs fixing too of course.

Hmm. I was hoping it could stay simple as it is just a static constant
(for a given NR_CPUS) overhead. I wonder if bootmem is still up here?
How fine grained is it these days? 

Could bite the bullet and do a multi-stage bootstap like SLUB, but I
want to try avoiding that (but init code is also of course much less
important than core code and total overheads). 


> > +static void free_kmem_cache_cpus(struct kmem_cache *s)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> 
> Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

I think you're right.

 
> > +static void cache_trim_worker(struct work_struct *w)
> > +{
> > +	struct delayed_work *work =
> > +		container_of(w, struct delayed_work, work);
> > +	struct kmem_cache *s;
> > +	int node;
> > +
> > +	if (!down_read_trylock(&slqb_lock))
> > +		goto out;
> 
> No counter for this?

It's quite unimportant. It will only race with creating or destroying
actual kmem caches, and cache trimming is infrequent too.


> > +	down_read(&slqb_lock);
> > +	list_for_each_entry(s, &slab_caches, list) {
> > +		/*
> > +		 * XXX: kmem_cache_alloc_node will fallback to other nodes
> > +		 *      since memory is not yet available from the node that
> > +		 *      is brought up.
> > +		 */
> > +		if (s->node[nid]) /* could be lefover from last online */
> > +			continue;
> > +		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> > +		if (!n) {
> > +			ret = -ENOMEM;
> 
> Surely that should panic? I don't think a slab less node will
> be very useful later.

Returning error here I think will just fail the online operation?
Better than a panic :)


> > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > +{
> > +	return sprintf(buf, "%d\n", s->align);
> > +}
> > +SLAB_ATTR_RO(align);
> > +
> 
> When you map back to the attribute you can use a index into a table
> for the field, saving that many functions?
> 
> > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
> 
> This really should be table driven, shouldn't it? That would give much
> smaller code.

Tables probably would help. I will keep it close to SLUB for now,
though.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 11:25     ` Nick Piggin
@ 2009-01-23 11:57       ` Andi Kleen
  -1 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 11:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > +#ifdef CONFIG_SLQB_SYSFS
> > > +	struct kobject kobj;	/* For sysfs */
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > > +#endif
> > > +#ifdef CONFIG_SMP
> > > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> > 
> > Those both really need to be dynamically allocated, otherwise
> > it wastes a lot of memory in the common case
> > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> > 
> > You could use alloc_percpu? There's no alloc_pernode 
> > unfortunately, perhaps there should be one. 
> 
> cpu_slab is dynamically allocated, by just changing the size of
> the kmem_cache cache at boot time. 

You'll always have at least the MAX_NUMNODES waste because
you cannot tell the compiler that the cpu_slab field has 
moved.

> Probably the best way would
> be to have dynamic cpu and node allocs for them, I agree.

It's really needed.

> Any plans for an alloc_pernode?

It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

> > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > + *   a default closest home node via which it can use fastpath functions.
> > 
> > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > that too and be happy.
> 
> What if the node is possible but not currently online?

Nobody should allocate on it then.

> > > +/* Not all arches define cache_line_size */
> > > +#ifndef cache_line_size
> > > +#define cache_line_size()	L1_CACHE_BYTES
> > > +#endif
> > > +
> > 
> > They should. better fix them?
> 
> git grep -l -e cache_line_size arch/ | egrep '\.h$'
> 
> Only ia64, mips, powerpc, sparc, x86...

It's straight forward to that define everywhere.

> 
> > > +	if (unlikely(slab_poison(s)))
> > > +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > > +
> > > +	start += colour;
> > 
> > One thing i was wondering. Did you try to disable the colouring and see
> > if it makes much difference on modern systems? They tend to have either
> > larger caches or higher associativity caches.
> 
> I have tried, but I don't think I found a test where it made a
> statistically significant difference. It is not very costly to
> implement, though.

how about the memory usage?

also this is all so complicated already that every simplification helps.

> > > +#endif
> > > +
> > > +#ifdef CONFIG_NUMA
> > > +static struct kmem_cache kmem_node_cache;
> > > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > > +#endif
> > 
> > That all needs fixing too of course.
> 
> Hmm. I was hoping it could stay simple as it is just a static constant
> (for a given NR_CPUS) overhead. 

The issue is that distro kernels typically run with NR_CPUS >>> num_possible_cpus()
And we'll see likely higher NR_CPUS (and MAX_NUMNODES) in the future,
but also still want to run the same kernels on really small systems (e.g.
Atom based) without wasting their memory.  

So for anything NR_CPUS you should use per_cpu data -- that is correctly
sized automatically.

For MAX_NUMNODES we don't have anything equivalent currently, so 
you would also need alloc_pernode() I guess.

Ok you can just use per cpu for them too and only use the first
entry in each node. That's cheating, but not too bad.


> I wonder if bootmem is still up here?

bootmem is finished when slab comes up.
> 
> Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> want to try avoiding that (but init code is also of course much less
> important than core code and total overheads). 

For DEFINE_PER_CPU you don't need special allocation.

Probably want a DEFINE_PER_NODE() for this or see above.

> 
> > > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > > +{
> > > +	return sprintf(buf, "%d\n", s->align);
> > > +}
> > > +SLAB_ATTR_RO(align);
> > > +
> > 
> > When you map back to the attribute you can use a index into a table
> > for the field, saving that many functions?
> > 
> > > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
> > 
> > This really should be table driven, shouldn't it? That would give much
> > smaller code.
> 
> Tables probably would help. I will keep it close to SLUB for now,
> though.

Hmm, then fix slub? 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 11:57       ` Andi Kleen
  0 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 11:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > +#ifdef CONFIG_SLQB_SYSFS
> > > +	struct kobject kobj;	/* For sysfs */
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > > +#endif
> > > +#ifdef CONFIG_SMP
> > > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> > 
> > Those both really need to be dynamically allocated, otherwise
> > it wastes a lot of memory in the common case
> > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> > 
> > You could use alloc_percpu? There's no alloc_pernode 
> > unfortunately, perhaps there should be one. 
> 
> cpu_slab is dynamically allocated, by just changing the size of
> the kmem_cache cache at boot time. 

You'll always have at least the MAX_NUMNODES waste because
you cannot tell the compiler that the cpu_slab field has 
moved.

> Probably the best way would
> be to have dynamic cpu and node allocs for them, I agree.

It's really needed.

> Any plans for an alloc_pernode?

It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

> > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > + *   a default closest home node via which it can use fastpath functions.
> > 
> > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > that too and be happy.
> 
> What if the node is possible but not currently online?

Nobody should allocate on it then.

> > > +/* Not all arches define cache_line_size */
> > > +#ifndef cache_line_size
> > > +#define cache_line_size()	L1_CACHE_BYTES
> > > +#endif
> > > +
> > 
> > They should. better fix them?
> 
> git grep -l -e cache_line_size arch/ | egrep '\.h$'
> 
> Only ia64, mips, powerpc, sparc, x86...

It's straight forward to that define everywhere.

> 
> > > +	if (unlikely(slab_poison(s)))
> > > +		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > > +
> > > +	start += colour;
> > 
> > One thing i was wondering. Did you try to disable the colouring and see
> > if it makes much difference on modern systems? They tend to have either
> > larger caches or higher associativity caches.
> 
> I have tried, but I don't think I found a test where it made a
> statistically significant difference. It is not very costly to
> implement, though.

how about the memory usage?

also this is all so complicated already that every simplification helps.

> > > +#endif
> > > +
> > > +#ifdef CONFIG_NUMA
> > > +static struct kmem_cache kmem_node_cache;
> > > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > > +#endif
> > 
> > That all needs fixing too of course.
> 
> Hmm. I was hoping it could stay simple as it is just a static constant
> (for a given NR_CPUS) overhead. 

The issue is that distro kernels typically run with NR_CPUS >>> num_possible_cpus()
And we'll see likely higher NR_CPUS (and MAX_NUMNODES) in the future,
but also still want to run the same kernels on really small systems (e.g.
Atom based) without wasting their memory.  

So for anything NR_CPUS you should use per_cpu data -- that is correctly
sized automatically.

For MAX_NUMNODES we don't have anything equivalent currently, so 
you would also need alloc_pernode() I guess.

Ok you can just use per cpu for them too and only use the first
entry in each node. That's cheating, but not too bad.


> I wonder if bootmem is still up here?

bootmem is finished when slab comes up.
> 
> Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> want to try avoiding that (but init code is also of course much less
> important than core code and total overheads). 

For DEFINE_PER_CPU you don't need special allocation.

Probably want a DEFINE_PER_NODE() for this or see above.

> 
> > > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > > +{
> > > +	return sprintf(buf, "%d\n", s->align);
> > > +}
> > > +SLAB_ATTR_RO(align);
> > > +
> > 
> > When you map back to the attribute you can use a index into a table
> > for the field, saving that many functions?
> > 
> > > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
> > 
> > This really should be table driven, shouldn't it? That would give much
> > smaller code.
> 
> Tables probably would help. I will keep it close to SLUB for now,
> though.

Hmm, then fix slub? 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  9:55   ` Andi Kleen
@ 2009-01-23 12:55     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 12:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> 
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm, it only takes up another couple of hundred bytes for a full
numa kernel. Completely out of lining it can take a slightly slower
path and makes the code slightly different from the kmalloc case.
So I'll leave this change for now.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 12:55     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 12:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> 
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm, it only takes up another couple of hundred bytes for a full
numa kernel. Completely out of lining it can take a slightly slower
path and makes the code slightly different from the kmalloc case.
So I'll leave this change for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  6:14         ` Nick Piggin
@ 2009-01-23 12:56           ` Ingo Molnar
  -1 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-23 12:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> > -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> > -				enum stat_item si)
> > +static inline void
> > +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
> >  {
> 
> Hmm, I'm not entirely fond of this style. [...]

well, it's a borderline situation and a nuance, and i think we agree on 
the two (much more common) boundary conditions:

 1) line fits into 80 cols - in that case we keep it all on a single line
    (this is the ideal case)

 2) line does not fit on two lines either - in that case we do the style
    that you used above.

On the boundary there's a special case though, and i tend to prefer:

 +static inline void
 +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)

over:

 -static inline void slqb_stat_inc(struct kmem_cache_list *list,
 -				enum stat_item si)

for two reasons:

 1) the line break is not just arbitrarily in the middle of the 
    enumeration of arguments - it is right after function return type.

 2) the arguments fit on a single line - and often one wants to know that 
    signature. (return values are usually a separate thought)

 3) the return type stands out much better.

But again ... this is a nuance.

> [...] The former scales to longer lines with just a single style change 
> (putting args into new lines), wheras the latter first moves its 
> prefixes to a newline, then moves args as the line grows even longer.

the moment this 'boundary style' "overflows", it falls back to the 'lots 
of lines' case, where we generally put the function return type and the 
function name on the first line.

> I guess it is a matter of taste, not wrong either way... but I think 
> most of the mm code I'm used to looking at uses the former. Do you feel 
> strongly?

there are a handful of cases where the return type (and the function 
attributes) are _really_ long - in this case it really helps to have them 
decoupled from the arguments.

	Ingo

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 12:56           ` Ingo Molnar
  0 siblings, 0 replies; 197+ messages in thread
From: Ingo Molnar @ 2009-01-23 12:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


* Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> > -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> > -				enum stat_item si)
> > +static inline void
> > +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
> >  {
> 
> Hmm, I'm not entirely fond of this style. [...]

well, it's a borderline situation and a nuance, and i think we agree on 
the two (much more common) boundary conditions:

 1) line fits into 80 cols - in that case we keep it all on a single line
    (this is the ideal case)

 2) line does not fit on two lines either - in that case we do the style
    that you used above.

On the boundary there's a special case though, and i tend to prefer:

 +static inline void
 +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)

over:

 -static inline void slqb_stat_inc(struct kmem_cache_list *list,
 -				enum stat_item si)

for two reasons:

 1) the line break is not just arbitrarily in the middle of the 
    enumeration of arguments - it is right after function return type.

 2) the arguments fit on a single line - and often one wants to know that 
    signature. (return values are usually a separate thought)

 3) the return type stands out much better.

But again ... this is a nuance.

> [...] The former scales to longer lines with just a single style change 
> (putting args into new lines), wheras the latter first moves its 
> prefixes to a newline, then moves args as the line grows even longer.

the moment this 'boundary style' "overflows", it falls back to the 'lots 
of lines' case, where we generally put the function return type and the 
function name on the first line.

> I guess it is a matter of taste, not wrong either way... but I think 
> most of the mm code I'm used to looking at uses the former. Do you feel 
> strongly?

there are a handful of cases where the return type (and the function 
attributes) are _really_ long - in this case it really helps to have them 
decoupled from the arguments.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 11:57       ` Andi Kleen
@ 2009-01-23 13:18         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 13:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 12:57:31PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > > +#ifdef CONFIG_SLQB_SYSFS
> > > > +	struct kobject kobj;	/* For sysfs */
> > > > +#endif
> > > > +#ifdef CONFIG_NUMA
> > > > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > > > +#endif
> > > > +#ifdef CONFIG_SMP
> > > > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> > > 
> > > Those both really need to be dynamically allocated, otherwise
> > > it wastes a lot of memory in the common case
> > > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> > > 
> > > You could use alloc_percpu? There's no alloc_pernode 
> > > unfortunately, perhaps there should be one. 
> > 
> > cpu_slab is dynamically allocated, by just changing the size of
> > the kmem_cache cache at boot time. 
> 
> You'll always have at least the MAX_NUMNODES waste because
> you cannot tell the compiler that the cpu_slab field has 
> moved.

Right. It could go into a completely different per-cpu structure
if needed to work around that (using node is a relatively rare
operation). But an alloc_pernode would be nicer.

 
> > Probably the best way would
> > be to have dynamic cpu and node allocs for them, I agree.
> 
> It's really needed.
> 
> > Any plans for an alloc_pernode?
> 
> It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

Just if you knew about plans. I won't get too much time to work on
it next week, so I hope to have something in slab tree in the
meantime. I think it is OK to leave now, with a mind to improving
it before a possible mainline merge (there will possibly be more
serious issues discovered anyway).


> > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > + *   a default closest home node via which it can use fastpath functions.
> > > 
> > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > > that too and be happy.
> > 
> > What if the node is possible but not currently online?
> 
> Nobody should allocate on it then.

But then it goes online and what happens? Your numa_node_id() changes?
How does that work? Or you mean x86-64 does not do that same trick for
possible but offline nodes?


> > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> > 
> > Only ia64, mips, powerpc, sparc, x86...
> 
> It's straight forward to that define everywhere.

OK, but this code is just copied straight from SLAB... I don't want
to add such dependency at this point I'm trying to get something
reasonable to merge. But it would be a fine cleanup.


> > > One thing i was wondering. Did you try to disable the colouring and see
> > > if it makes much difference on modern systems? They tend to have either
> > > larger caches or higher associativity caches.
> > 
> > I have tried, but I don't think I found a test where it made a
> > statistically significant difference. It is not very costly to
> > implement, though.
> 
> how about the memory usage?
> 
> also this is all so complicated already that every simplification helps.

Oh, it only uses slack space in the slabs as such, so it should be
almost zero cost. I tried testing extra colour at the cost of space, but
no obvious difference there either. But I think I'll leave in the code
because it might be a win for some embedded or unusual CPUs.


> > Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> > want to try avoiding that (but init code is also of course much less
> > important than core code and total overheads). 
> 
> For DEFINE_PER_CPU you don't need special allocation.
> 
> Probably want a DEFINE_PER_NODE() for this or see above.

Ah yes DEFINE_PER_CPU of course. Not quite correct for per-node data,
but it should be good enough for wider testing in linux-next.


> > Tables probably would help. I will keep it close to SLUB for now,
> > though.
> 
> Hmm, then fix slub? 

That's my plan, but I go about it a different way ;) I don't want to
spend too much time on other allocators or cleanup etc code too much
right now (except cleanups in SLQB, which of course is required).

Here is an incremental patch for your review points. Thanks very much,
it's a big improvement (getting rid of those static arrays vastly
decreases memory consumption with bigger NR_CPUs, so that's a good
start; will need to investigate alloc_percpu / pernode etc, but that
may have to wait until next week.

---
 include/linux/slab.h     |    4 +
 include/linux/slqb_def.h |   10 +++
 mm/slqb.c                |  125 ++++++++++++++++++++++++++---------------------
 3 files changed, 82 insertions(+), 57 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -65,6 +65,10 @@
 /* The following flags affect the page allocator grouping pages by mobility */
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL		/* Objects are reclaimable */
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
+
+/* Following flags should only be used by allocator specific flags */
+#define SLAB_ALLOC_PRIVATE	0x000000ffUL
+
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -15,6 +15,8 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 
+#define SLAB_NUMA		0x00000001UL    /* shortcut */
+
 enum stat_item {
 	ALLOC,			/* Allocation count */
 	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
@@ -224,12 +226,16 @@ static __always_inline int kmalloc_index
 
 /*
  * Find the kmalloc slab cache for a given combination of allocation flags and
- * size.
+ * size. Should really only be used for constant 'size' arguments, due to
+ * bloat.
  */
 static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 {
-	int index = kmalloc_index(size);
+	int index;
+
+	BUILD_BUG_ON(!__builtin_constant_p(size));
 
+	index = kmalloc_index(size);
 	if (unlikely(index == 0))
 		return NULL;
 
Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -58,9 +58,15 @@ static inline void struct_slqb_page_wron
 
 static int kmem_size __read_mostly;
 #ifdef CONFIG_NUMA
-static int numa_platform __read_mostly;
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return s->flags & SLAB_NUMA;
+}
 #else
-static const int numa_platform = 0;
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return 0;
+}
 #endif
 
 static inline int slab_hiwater(struct kmem_cache *s)
@@ -166,19 +172,6 @@ static inline struct slqb_page *virt_to_
 	return (struct slqb_page *)p;
 }
 
-static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
-						unsigned int order)
-{
-	struct page *p;
-
-	if (nid == -1)
-		p = alloc_pages(flags, order);
-	else
-		p = alloc_pages_node(nid, flags, order);
-
-	return (struct slqb_page *)p;
-}
-
 static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
 {
 	struct page *p = &page->page;
@@ -231,8 +224,16 @@ static inline int slab_poison(struct kme
 static struct notifier_block slab_notifier;
 #endif
 
-/* A list of all slab caches on the system */
+/*
+ * slqb_lock protects slab_caches list and serialises hotplug operations.
+ * hotplug operations take lock for write, other operations can hold off
+ * hotplug by taking it for read (or write).
+ */
 static DECLARE_RWSEM(slqb_lock);
+
+/*
+ * A list of all slab caches on the system
+ */
 static LIST_HEAD(slab_caches);
 
 /*
@@ -875,6 +876,9 @@ static unsigned long kmem_cache_flags(un
 		strlen(slqb_debug_slabs)) == 0))
 			flags |= slqb_debug;
 
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
+
 	return flags;
 }
 #else
@@ -913,6 +917,8 @@ static inline void add_full(struct kmem_
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name, void (*ctor)(void *))
 {
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
 	return flags;
 }
 
@@ -930,7 +936,7 @@ static struct slqb_page *allocate_slab(s
 
 	flags |= s->allocflags;
 
-	page = alloc_slqb_pages_node(node, flags, s->order);
+	page = (struct slqb_page *)alloc_pages_node(node, flags, s->order);
 	if (!page)
 		return NULL;
 
@@ -1296,8 +1302,6 @@ static noinline void *__slab_alloc_page(
 	if (c->colour_next >= s->colour_range)
 		c->colour_next = 0;
 
-	/* XXX: load any partial? */
-
 	/* Caller handles __GFP_ZERO */
 	gfpflags &= ~__GFP_ZERO;
 
@@ -1622,7 +1626,7 @@ static __always_inline void __slab_free(
 
 	slqb_stat_inc(l, FREE);
 
-	if (!NUMA_BUILD || !numa_platform ||
+	if (!NUMA_BUILD || !slab_numa(s) ||
 			likely(slqb_page_to_nid(page) == numa_node_id())) {
 		/*
 		 * Freeing fastpath. Collects all local-node objects, not
@@ -1676,7 +1680,7 @@ void kmem_cache_free(struct kmem_cache *
 {
 	struct slqb_page *page = NULL;
 
-	if (numa_platform)
+	if (slab_numa(s))
 		page = virt_to_head_slqb_page(object);
 	slab_free(s, page, object);
 }
@@ -1816,26 +1820,28 @@ static void init_kmem_cache_node(struct
 }
 #endif
 
-/* Initial slabs */
+/* Initial slabs. XXX: allocate dynamically (with bootmem maybe) */
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
 #endif
 #ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
+ * a static array */
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
 #endif
 
 #ifdef CONFIG_SMP
 static struct kmem_cache kmem_cpu_cache;
-static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
 #ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
 #endif
 #endif
 
 #ifdef CONFIG_NUMA
 static struct kmem_cache kmem_node_cache;
-static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
-static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
 #endif
 
 #ifdef CONFIG_SMP
@@ -2090,15 +2096,15 @@ static int kmem_cache_open(struct kmem_c
 		s->colour_range = 0;
 	}
 
+	down_write(&slqb_lock);
 	if (likely(alloc)) {
 		if (!alloc_kmem_cache_nodes(s))
-			goto error;
+			goto error_lock;
 
 		if (!alloc_kmem_cache_cpus(s))
 			goto error_nodes;
 	}
 
-	down_write(&slqb_lock);
 	sysfs_slab_add(s);
 	list_add(&s->list, &slab_caches);
 	up_write(&slqb_lock);
@@ -2107,6 +2113,8 @@ static int kmem_cache_open(struct kmem_c
 
 error_nodes:
 	free_kmem_cache_nodes(s);
+error_lock:
+	up_write(&slqb_lock);
 error:
 	if (flags & SLAB_PANIC)
 		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
@@ -2180,7 +2188,6 @@ void kmem_cache_destroy(struct kmem_cach
 
 	down_write(&slqb_lock);
 	list_del(&s->list);
-	up_write(&slqb_lock);
 
 #ifdef CONFIG_SMP
 	for_each_online_cpu(cpu) {
@@ -2230,6 +2237,7 @@ void kmem_cache_destroy(struct kmem_cach
 #endif
 
 	sysfs_slab_remove(s);
+	up_write(&slqb_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -2603,7 +2611,7 @@ static int slab_mem_going_online_callbac
 	 * allocate a kmem_cache_node structure in order to bring the node
 	 * online.
 	 */
-	down_read(&slqb_lock);
+	down_write(&slqb_lock);
 	list_for_each_entry(s, &slab_caches, list) {
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
@@ -2621,7 +2629,7 @@ static int slab_mem_going_online_callbac
 		s->node[nid] = n;
 	}
 out:
-	up_read(&slqb_lock);
+	up_write(&slqb_lock);
 	return ret;
 }
 
@@ -2665,13 +2673,6 @@ void __init kmem_cache_init(void)
 	 * All the ifdefs are rather ugly here, but it's just the setup code,
 	 * so it doesn't have to be too readable :)
 	 */
-#ifdef CONFIG_NUMA
-	if (num_possible_nodes() == 1)
-		numa_platform = 0;
-	else
-		numa_platform = 1;
-#endif
-
 #ifdef CONFIG_SMP
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
 				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
@@ -2692,15 +2693,20 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	for_each_possible_cpu(i) {
-		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
-		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+		struct kmem_cache_cpu *c;
 
-		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
-		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+		c = &per_cpu(kmem_cache_cpus, i);
+		init_kmem_cache_cpu(&kmem_cache_cache, c);
+		kmem_cache_cache.cpu_slab[i] = c;
+
+		c = &per_cpu(kmem_cpu_cpus, i);
+		init_kmem_cache_cpu(&kmem_cpu_cache, c);
+		kmem_cpu_cache.cpu_slab[i] = c;
 
 #ifdef CONFIG_NUMA
-		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
-		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+		c = &per_cpu(kmem_node_cpus, i);
+		init_kmem_cache_cpu(&kmem_node_cache, c);
+		kmem_node_cache.cpu_slab[i] = c;
 #endif
 	}
 #else
@@ -2709,14 +2715,19 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_NUMA
 	for_each_node_state(i, N_NORMAL_MEMORY) {
-		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
-		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
-
-		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
-		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+		struct kmem_cache_node *n;
 
-		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
-		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+		n = &per_cpu(kmem_cache_nodes, i);
+		init_kmem_cache_node(&kmem_cache_cache, n);
+		kmem_cache_cache.node[i] = n;
+
+		n = &per_cpu(kmem_cpu_nodes, i);
+		init_kmem_cache_node(&kmem_cpu_cache, n);
+		kmem_cpu_cache.node[i] = n;
+
+		n = &per_cpu(kmem_node_nodes, i);
+		init_kmem_cache_node(&kmem_node_cache, n);
+		kmem_node_cache.node[i] = n;
 	}
 #endif
 
@@ -2883,7 +2894,7 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slqb_lock);
+		down_write(&slqb_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			if (s->cpu_slab[cpu]) /* could be lefover last online */
 				continue;
@@ -2893,7 +2904,7 @@ static int __cpuinit slab_cpuup_callback
 				return NOTIFY_BAD;
 			}
 		}
-		up_read(&slqb_lock);
+		up_write(&slqb_lock);
 		break;
 
 	case CPU_ONLINE:
@@ -3019,6 +3030,8 @@ static void gather_stats(struct kmem_cac
 	stats->s = s;
 	spin_lock_init(&stats->lock);
 
+	down_read(&slqb_lock); /* hold off hotplug */
+
 	on_each_cpu(__gather_stats, stats, 1);
 
 #ifdef CONFIG_NUMA
@@ -3047,6 +3060,8 @@ static void gather_stats(struct kmem_cac
 	}
 #endif
 
+	up_read(&slqb_lock);
+
 	stats->nr_objects = stats->nr_slabs * s->objects;
 }
 #endif

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 13:18         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 13:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, Jan 23, 2009 at 12:57:31PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > > +#ifdef CONFIG_SLQB_SYSFS
> > > > +	struct kobject kobj;	/* For sysfs */
> > > > +#endif
> > > > +#ifdef CONFIG_NUMA
> > > > +	struct kmem_cache_node *node[MAX_NUMNODES];
> > > > +#endif
> > > > +#ifdef CONFIG_SMP
> > > > +	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> > > 
> > > Those both really need to be dynamically allocated, otherwise
> > > it wastes a lot of memory in the common case
> > > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> > > 
> > > You could use alloc_percpu? There's no alloc_pernode 
> > > unfortunately, perhaps there should be one. 
> > 
> > cpu_slab is dynamically allocated, by just changing the size of
> > the kmem_cache cache at boot time. 
> 
> You'll always have at least the MAX_NUMNODES waste because
> you cannot tell the compiler that the cpu_slab field has 
> moved.

Right. It could go into a completely different per-cpu structure
if needed to work around that (using node is a relatively rare
operation). But an alloc_pernode would be nicer.

 
> > Probably the best way would
> > be to have dynamic cpu and node allocs for them, I agree.
> 
> It's really needed.
> 
> > Any plans for an alloc_pernode?
> 
> It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

Just if you knew about plans. I won't get too much time to work on
it next week, so I hope to have something in slab tree in the
meantime. I think it is OK to leave now, with a mind to improving
it before a possible mainline merge (there will possibly be more
serious issues discovered anyway).


> > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > + *   a default closest home node via which it can use fastpath functions.
> > > 
> > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > > that too and be happy.
> > 
> > What if the node is possible but not currently online?
> 
> Nobody should allocate on it then.

But then it goes online and what happens? Your numa_node_id() changes?
How does that work? Or you mean x86-64 does not do that same trick for
possible but offline nodes?


> > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> > 
> > Only ia64, mips, powerpc, sparc, x86...
> 
> It's straight forward to that define everywhere.

OK, but this code is just copied straight from SLAB... I don't want
to add such dependency at this point I'm trying to get something
reasonable to merge. But it would be a fine cleanup.


> > > One thing i was wondering. Did you try to disable the colouring and see
> > > if it makes much difference on modern systems? They tend to have either
> > > larger caches or higher associativity caches.
> > 
> > I have tried, but I don't think I found a test where it made a
> > statistically significant difference. It is not very costly to
> > implement, though.
> 
> how about the memory usage?
> 
> also this is all so complicated already that every simplification helps.

Oh, it only uses slack space in the slabs as such, so it should be
almost zero cost. I tried testing extra colour at the cost of space, but
no obvious difference there either. But I think I'll leave in the code
because it might be a win for some embedded or unusual CPUs.


> > Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> > want to try avoiding that (but init code is also of course much less
> > important than core code and total overheads). 
> 
> For DEFINE_PER_CPU you don't need special allocation.
> 
> Probably want a DEFINE_PER_NODE() for this or see above.

Ah yes DEFINE_PER_CPU of course. Not quite correct for per-node data,
but it should be good enough for wider testing in linux-next.


> > Tables probably would help. I will keep it close to SLUB for now,
> > though.
> 
> Hmm, then fix slub? 

That's my plan, but I go about it a different way ;) I don't want to
spend too much time on other allocators or cleanup etc code too much
right now (except cleanups in SLQB, which of course is required).

Here is an incremental patch for your review points. Thanks very much,
it's a big improvement (getting rid of those static arrays vastly
decreases memory consumption with bigger NR_CPUs, so that's a good
start; will need to investigate alloc_percpu / pernode etc, but that
may have to wait until next week.

---
 include/linux/slab.h     |    4 +
 include/linux/slqb_def.h |   10 +++
 mm/slqb.c                |  125 ++++++++++++++++++++++++++---------------------
 3 files changed, 82 insertions(+), 57 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -65,6 +65,10 @@
 /* The following flags affect the page allocator grouping pages by mobility */
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL		/* Objects are reclaimable */
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
+
+/* Following flags should only be used by allocator specific flags */
+#define SLAB_ALLOC_PRIVATE	0x000000ffUL
+
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -15,6 +15,8 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 
+#define SLAB_NUMA		0x00000001UL    /* shortcut */
+
 enum stat_item {
 	ALLOC,			/* Allocation count */
 	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
@@ -224,12 +226,16 @@ static __always_inline int kmalloc_index
 
 /*
  * Find the kmalloc slab cache for a given combination of allocation flags and
- * size.
+ * size. Should really only be used for constant 'size' arguments, due to
+ * bloat.
  */
 static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 {
-	int index = kmalloc_index(size);
+	int index;
+
+	BUILD_BUG_ON(!__builtin_constant_p(size));
 
+	index = kmalloc_index(size);
 	if (unlikely(index == 0))
 		return NULL;
 
Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -58,9 +58,15 @@ static inline void struct_slqb_page_wron
 
 static int kmem_size __read_mostly;
 #ifdef CONFIG_NUMA
-static int numa_platform __read_mostly;
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return s->flags & SLAB_NUMA;
+}
 #else
-static const int numa_platform = 0;
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return 0;
+}
 #endif
 
 static inline int slab_hiwater(struct kmem_cache *s)
@@ -166,19 +172,6 @@ static inline struct slqb_page *virt_to_
 	return (struct slqb_page *)p;
 }
 
-static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
-						unsigned int order)
-{
-	struct page *p;
-
-	if (nid == -1)
-		p = alloc_pages(flags, order);
-	else
-		p = alloc_pages_node(nid, flags, order);
-
-	return (struct slqb_page *)p;
-}
-
 static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
 {
 	struct page *p = &page->page;
@@ -231,8 +224,16 @@ static inline int slab_poison(struct kme
 static struct notifier_block slab_notifier;
 #endif
 
-/* A list of all slab caches on the system */
+/*
+ * slqb_lock protects slab_caches list and serialises hotplug operations.
+ * hotplug operations take lock for write, other operations can hold off
+ * hotplug by taking it for read (or write).
+ */
 static DECLARE_RWSEM(slqb_lock);
+
+/*
+ * A list of all slab caches on the system
+ */
 static LIST_HEAD(slab_caches);
 
 /*
@@ -875,6 +876,9 @@ static unsigned long kmem_cache_flags(un
 		strlen(slqb_debug_slabs)) == 0))
 			flags |= slqb_debug;
 
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
+
 	return flags;
 }
 #else
@@ -913,6 +917,8 @@ static inline void add_full(struct kmem_
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name, void (*ctor)(void *))
 {
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
 	return flags;
 }
 
@@ -930,7 +936,7 @@ static struct slqb_page *allocate_slab(s
 
 	flags |= s->allocflags;
 
-	page = alloc_slqb_pages_node(node, flags, s->order);
+	page = (struct slqb_page *)alloc_pages_node(node, flags, s->order);
 	if (!page)
 		return NULL;
 
@@ -1296,8 +1302,6 @@ static noinline void *__slab_alloc_page(
 	if (c->colour_next >= s->colour_range)
 		c->colour_next = 0;
 
-	/* XXX: load any partial? */
-
 	/* Caller handles __GFP_ZERO */
 	gfpflags &= ~__GFP_ZERO;
 
@@ -1622,7 +1626,7 @@ static __always_inline void __slab_free(
 
 	slqb_stat_inc(l, FREE);
 
-	if (!NUMA_BUILD || !numa_platform ||
+	if (!NUMA_BUILD || !slab_numa(s) ||
 			likely(slqb_page_to_nid(page) == numa_node_id())) {
 		/*
 		 * Freeing fastpath. Collects all local-node objects, not
@@ -1676,7 +1680,7 @@ void kmem_cache_free(struct kmem_cache *
 {
 	struct slqb_page *page = NULL;
 
-	if (numa_platform)
+	if (slab_numa(s))
 		page = virt_to_head_slqb_page(object);
 	slab_free(s, page, object);
 }
@@ -1816,26 +1820,28 @@ static void init_kmem_cache_node(struct
 }
 #endif
 
-/* Initial slabs */
+/* Initial slabs. XXX: allocate dynamically (with bootmem maybe) */
 #ifdef CONFIG_SMP
-static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
 #endif
 #ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
+ * a static array */
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
 #endif
 
 #ifdef CONFIG_SMP
 static struct kmem_cache kmem_cpu_cache;
-static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
 #ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
 #endif
 #endif
 
 #ifdef CONFIG_NUMA
 static struct kmem_cache kmem_node_cache;
-static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
-static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
 #endif
 
 #ifdef CONFIG_SMP
@@ -2090,15 +2096,15 @@ static int kmem_cache_open(struct kmem_c
 		s->colour_range = 0;
 	}
 
+	down_write(&slqb_lock);
 	if (likely(alloc)) {
 		if (!alloc_kmem_cache_nodes(s))
-			goto error;
+			goto error_lock;
 
 		if (!alloc_kmem_cache_cpus(s))
 			goto error_nodes;
 	}
 
-	down_write(&slqb_lock);
 	sysfs_slab_add(s);
 	list_add(&s->list, &slab_caches);
 	up_write(&slqb_lock);
@@ -2107,6 +2113,8 @@ static int kmem_cache_open(struct kmem_c
 
 error_nodes:
 	free_kmem_cache_nodes(s);
+error_lock:
+	up_write(&slqb_lock);
 error:
 	if (flags & SLAB_PANIC)
 		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
@@ -2180,7 +2188,6 @@ void kmem_cache_destroy(struct kmem_cach
 
 	down_write(&slqb_lock);
 	list_del(&s->list);
-	up_write(&slqb_lock);
 
 #ifdef CONFIG_SMP
 	for_each_online_cpu(cpu) {
@@ -2230,6 +2237,7 @@ void kmem_cache_destroy(struct kmem_cach
 #endif
 
 	sysfs_slab_remove(s);
+	up_write(&slqb_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -2603,7 +2611,7 @@ static int slab_mem_going_online_callbac
 	 * allocate a kmem_cache_node structure in order to bring the node
 	 * online.
 	 */
-	down_read(&slqb_lock);
+	down_write(&slqb_lock);
 	list_for_each_entry(s, &slab_caches, list) {
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
@@ -2621,7 +2629,7 @@ static int slab_mem_going_online_callbac
 		s->node[nid] = n;
 	}
 out:
-	up_read(&slqb_lock);
+	up_write(&slqb_lock);
 	return ret;
 }
 
@@ -2665,13 +2673,6 @@ void __init kmem_cache_init(void)
 	 * All the ifdefs are rather ugly here, but it's just the setup code,
 	 * so it doesn't have to be too readable :)
 	 */
-#ifdef CONFIG_NUMA
-	if (num_possible_nodes() == 1)
-		numa_platform = 0;
-	else
-		numa_platform = 1;
-#endif
-
 #ifdef CONFIG_SMP
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
 				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
@@ -2692,15 +2693,20 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	for_each_possible_cpu(i) {
-		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
-		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+		struct kmem_cache_cpu *c;
 
-		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
-		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+		c = &per_cpu(kmem_cache_cpus, i);
+		init_kmem_cache_cpu(&kmem_cache_cache, c);
+		kmem_cache_cache.cpu_slab[i] = c;
+
+		c = &per_cpu(kmem_cpu_cpus, i);
+		init_kmem_cache_cpu(&kmem_cpu_cache, c);
+		kmem_cpu_cache.cpu_slab[i] = c;
 
 #ifdef CONFIG_NUMA
-		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
-		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+		c = &per_cpu(kmem_node_cpus, i);
+		init_kmem_cache_cpu(&kmem_node_cache, c);
+		kmem_node_cache.cpu_slab[i] = c;
 #endif
 	}
 #else
@@ -2709,14 +2715,19 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_NUMA
 	for_each_node_state(i, N_NORMAL_MEMORY) {
-		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
-		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
-
-		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
-		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+		struct kmem_cache_node *n;
 
-		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
-		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+		n = &per_cpu(kmem_cache_nodes, i);
+		init_kmem_cache_node(&kmem_cache_cache, n);
+		kmem_cache_cache.node[i] = n;
+
+		n = &per_cpu(kmem_cpu_nodes, i);
+		init_kmem_cache_node(&kmem_cpu_cache, n);
+		kmem_cpu_cache.node[i] = n;
+
+		n = &per_cpu(kmem_node_nodes, i);
+		init_kmem_cache_node(&kmem_node_cache, n);
+		kmem_node_cache.node[i] = n;
 	}
 #endif
 
@@ -2883,7 +2894,7 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slqb_lock);
+		down_write(&slqb_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			if (s->cpu_slab[cpu]) /* could be lefover last online */
 				continue;
@@ -2893,7 +2904,7 @@ static int __cpuinit slab_cpuup_callback
 				return NOTIFY_BAD;
 			}
 		}
-		up_read(&slqb_lock);
+		up_write(&slqb_lock);
 		break;
 
 	case CPU_ONLINE:
@@ -3019,6 +3030,8 @@ static void gather_stats(struct kmem_cac
 	stats->s = s;
 	spin_lock_init(&stats->lock);
 
+	down_read(&slqb_lock); /* hold off hotplug */
+
 	on_each_cpu(__gather_stats, stats, 1);
 
 #ifdef CONFIG_NUMA
@@ -3047,6 +3060,8 @@ static void gather_stats(struct kmem_cac
 	}
 #endif
 
+	up_read(&slqb_lock);
+
 	stats->nr_objects = stats->nr_slabs * s->objects;
 }
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  9:00     ` Nick Piggin
@ 2009-01-23 13:34       ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 13:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> ... Would you be able to test with this updated patch
> (which also includes Hugh's fix ...

In fact not: claim_remote_free_list() still has the offending unlocked
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 13:34       ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 13:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> ... Would you be able to test with this updated patch
> (which also includes Hugh's fix ...

In fact not: claim_remote_free_list() still has the offending unlocked
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 13:34       ` Hugh Dickins
@ 2009-01-23 13:44         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 13:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zhang, Yanmin, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, Jan 23, 2009 at 01:34:49PM +0000, Hugh Dickins wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> > 
> > ... Would you be able to test with this updated patch
> > (which also includes Hugh's fix ...
> 
> In fact not: claim_remote_free_list() still has the offending unlocked
> +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Doh, thanks. Turned out to still miss a few cases where it wasn't
checking for memoryless nodes (Andi explains why I didn't see it
with x86-64: because it handles the case differently and assigns
the default node to the nearest one with memory. I think).

Working on a new version, so I've definitely got your bug covered
now :)


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 13:44         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 13:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zhang, Yanmin, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, Jan 23, 2009 at 01:34:49PM +0000, Hugh Dickins wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> > 
> > ... Would you be able to test with this updated patch
> > (which also includes Hugh's fix ...
> 
> In fact not: claim_remote_free_list() still has the offending unlocked
> +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Doh, thanks. Turned out to still miss a few cases where it wasn't
checking for memoryless nodes (Andi explains why I didn't see it
with x86-64: because it handles the case differently and assigns
the default node to the nearest one with memory. I think).

Working on a new version, so I've definitely got your bug covered
now :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  3:55     ` Nick Piggin
@ 2009-01-23 13:57       ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 13:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, 23 Jan 2009, Nick Piggin wrote:
> On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> > 
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines.  Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there.  I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
> 
> Yeah, that's to be expected with higher order allocations I think. Does
> your immune machine simply have fewer CPUs and thus doesn't use such
> high order allocations?

No, it's just one of the quads.  Whereas the worst affected (laptop)
is a duo.  I should probably be worrying more about that one: it may
be that I'm thrashing it and its results are meaningless, though still
curious that slab and slqb and slob all do so markedly better on it.

It's behaving much better with slub_max_order=1 slub_min_objects=4,
but to get competitive I've had to switch off most of the debugging
options I usually have on that one - and I've not yet tried slab,
slqb and slob with those off too.  Hmm, it looks like its getting
progressively slower.

I'll continue to investigate at leisure,
but can't give it too much attention.

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 13:57       ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 13:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Fri, 23 Jan 2009, Nick Piggin wrote:
> On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> > 
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines.  Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there.  I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
> 
> Yeah, that's to be expected with higher order allocations I think. Does
> your immune machine simply have fewer CPUs and thus doesn't use such
> high order allocations?

No, it's just one of the quads.  Whereas the worst affected (laptop)
is a duo.  I should probably be worrying more about that one: it may
be that I'm thrashing it and its results are meaningless, though still
curious that slab and slqb and slob all do so markedly better on it.

It's behaving much better with slub_max_order=1 slub_min_objects=4,
but to get competitive I've had to switch off most of the debugging
options I usually have on that one - and I've not yet tried slab,
slqb and slob with those off too.  Hmm, it looks like its getting
progressively slower.

I'll continue to investigate at leisure,
but can't give it too much attention.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 13:18         ` Nick Piggin
@ 2009-01-23 14:04           ` Andi Kleen
  -1 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 14:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

[dropping lameters' outdated address]

On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
>  
> > > Probably the best way would
> > > be to have dynamic cpu and node allocs for them, I agree.
> > 
> > It's really needed.
> > 
> > > Any plans for an alloc_pernode?
> > 
> > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
> 
> Just if you knew about plans. I won't get too much time to work on

Not aware of anyone working on it.

> it next week, so I hope to have something in slab tree in the
> meantime. I think it is OK to leave now, with a mind to improving

Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
because they explode with CONFIG_MAXSMP.

> it before a possible mainline merge (there will possibly be more
> serious issues discovered anyway).

I see you fixed the static arrays.

Doing the same for the kmem_cache arrays with making them a pointer
and then using num_possible_{cpus,nodes}() would seem straight forward,
wouldn't it?

Although I think I would prefer alloc_percpu, possibly with
per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

> > > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > > + *   a default closest home node via which it can use fastpath functions.
> > > > 
> > > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > > > that too and be happy.
> > > 
> > > What if the node is possible but not currently online?
> > 
> > Nobody should allocate on it then.
> 
> But then it goes online and what happens? 

You already have a node online notifier that should handle that then, don't you?

x86-64 btw currently doesn't support node hotplug (but I expect it will
be added at some point), but it should be ok even on architectures
that do.

> Your numa_node_id() changes?

What do you mean?

> How does that work? Or you mean x86-64 does not do that same trick for
> possible but offline nodes?

All I'm saying is that when x86-64 finds a memory less node it assigns
its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
sets it with kmalloc_node() but that should normally not happen I think.

> 
> > > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> > > 
> > > Only ia64, mips, powerpc, sparc, x86...
> > 
> > It's straight forward to that define everywhere.
> 
> OK, but this code is just copied straight from SLAB... I don't want
> to add such dependency at this point I'm trying to get something

I'm sure such a straight forward change could be still put into .29

> reasonable to merge. But it would be a fine cleanup.

Hmm to be honest it's a little weird to post so much code and then
say you can't change large parts of it.

Could you perhaps mark all the code you don't want to change?

I'm not sure I follow the rationale for not changing code that has been
copied from elsewhere. If you copied it why can't you change it?
 
> > 
> > Hmm, then fix slub? 
> 
> That's my plan, but I go about it a different way ;) I don't want to
> spend too much time on other allocators or cleanup etc code too much
> right now (except cleanups in SLQB, which of course is required).

But still if you copy code from slub you can improve it, can't you?
The sysfs code definitely could be done much nicer (ok for small values
of "nice"; sysfs is always ugly of course @). But at least it can be
done in a way that doesn't bloat the text so much.

Thanks for the patch.

One thing I'm not sure about is using a private lock to hold off hotplug.
I don't have a concrete scenario, but it makes me uneasy considering
deadlocks when someone sleeps etc. Safer is get/put_online_cpus() 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 14:04           ` Andi Kleen
  0 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 14:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

[dropping lameters' outdated address]

On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
>  
> > > Probably the best way would
> > > be to have dynamic cpu and node allocs for them, I agree.
> > 
> > It's really needed.
> > 
> > > Any plans for an alloc_pernode?
> > 
> > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
> 
> Just if you knew about plans. I won't get too much time to work on

Not aware of anyone working on it.

> it next week, so I hope to have something in slab tree in the
> meantime. I think it is OK to leave now, with a mind to improving

Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
because they explode with CONFIG_MAXSMP.

> it before a possible mainline merge (there will possibly be more
> serious issues discovered anyway).

I see you fixed the static arrays.

Doing the same for the kmem_cache arrays with making them a pointer
and then using num_possible_{cpus,nodes}() would seem straight forward,
wouldn't it?

Although I think I would prefer alloc_percpu, possibly with
per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

> > > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > > + *   a default closest home node via which it can use fastpath functions.
> > > > 
> > > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do 
> > > > that too and be happy.
> > > 
> > > What if the node is possible but not currently online?
> > 
> > Nobody should allocate on it then.
> 
> But then it goes online and what happens? 

You already have a node online notifier that should handle that then, don't you?

x86-64 btw currently doesn't support node hotplug (but I expect it will
be added at some point), but it should be ok even on architectures
that do.

> Your numa_node_id() changes?

What do you mean?

> How does that work? Or you mean x86-64 does not do that same trick for
> possible but offline nodes?

All I'm saying is that when x86-64 finds a memory less node it assigns
its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
sets it with kmalloc_node() but that should normally not happen I think.

> 
> > > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> > > 
> > > Only ia64, mips, powerpc, sparc, x86...
> > 
> > It's straight forward to that define everywhere.
> 
> OK, but this code is just copied straight from SLAB... I don't want
> to add such dependency at this point I'm trying to get something

I'm sure such a straight forward change could be still put into .29

> reasonable to merge. But it would be a fine cleanup.

Hmm to be honest it's a little weird to post so much code and then
say you can't change large parts of it.

Could you perhaps mark all the code you don't want to change?

I'm not sure I follow the rationale for not changing code that has been
copied from elsewhere. If you copied it why can't you change it?
 
> > 
> > Hmm, then fix slub? 
> 
> That's my plan, but I go about it a different way ;) I don't want to
> spend too much time on other allocators or cleanup etc code too much
> right now (except cleanups in SLQB, which of course is required).

But still if you copy code from slub you can improve it, can't you?
The sysfs code definitely could be done much nicer (ok for small values
of "nice"; sysfs is always ugly of course @). But at least it can be
done in a way that doesn't bloat the text so much.

Thanks for the patch.

One thing I'm not sure about is using a private lock to hold off hotplug.
I don't have a concrete scenario, but it makes me uneasy considering
deadlocks when someone sleeps etc. Safer is get/put_online_cpus() 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22 12:47       ` Hugh Dickins
@ 2009-01-23 14:23         ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 14:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > >
> > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > more than SLAB) with swapping loads on most of my machines.  Though
> > > oddly one seems immune, and another takes four times as long: guess
> > > it depends on how close to thrashing, but probably more to investigate
> > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > FWIW - swapping loads are not what anybody should tune for.
> > 
> > What kind of machine are you seeing this on? It sounds like it could
> > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > ("slub: Calculate min_objects based on number of processors").
> 
> Thanks, yes, that could well account for the residual difference: the
> machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> has effectively become slub_min_objects=12 or slub_min_objects=16.
> 
> I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> lines (though I'll need to curtail tests on a couple of machines),
> and will report back later.

Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
swapping load.  I've not tried slub_max_order=0, but I'm running
with 8kB stacks, so order 1 seems a reasonable choice.

I can't say where I pulled that "e.g. 2% slower" from: on different
machines slub was 5% or 10% or 20% slower than slab and slqb even with
slub_max_order=1 (but not significantly slower on the "immune" machine).
How much slub_min_objects=4 helps again varies widely, between halving
or eliminating the difference.

But I think it's more important that I focus on the worst case machine,
try to understand what's going on there.

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 14:23         ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-01-23 14:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > >
> > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > more than SLAB) with swapping loads on most of my machines.  Though
> > > oddly one seems immune, and another takes four times as long: guess
> > > it depends on how close to thrashing, but probably more to investigate
> > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > FWIW - swapping loads are not what anybody should tune for.
> > 
> > What kind of machine are you seeing this on? It sounds like it could
> > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > ("slub: Calculate min_objects based on number of processors").
> 
> Thanks, yes, that could well account for the residual difference: the
> machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> has effectively become slub_min_objects=12 or slub_min_objects=16.
> 
> I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> lines (though I'll need to curtail tests on a couple of machines),
> and will report back later.

Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
swapping load.  I've not tried slub_max_order=0, but I'm running
with 8kB stacks, so order 1 seems a reasonable choice.

I can't say where I pulled that "e.g. 2% slower" from: on different
machines slub was 5% or 10% or 20% slower than slab and slqb even with
slub_max_order=1 (but not significantly slower on the "immune" machine).
How much slub_min_objects=4 helps again varies widely, between halving
or eliminating the difference.

But I think it's more important that I focus on the worst case machine,
try to understand what's going on there.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 14:04           ` Andi Kleen
@ 2009-01-23 14:27             ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 14:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 03:04:06PM +0100, Andi Kleen wrote:
> [dropping lameters' outdated address]
> 
> On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
> >  
> > > > Probably the best way would
> > > > be to have dynamic cpu and node allocs for them, I agree.
> > > 
> > > It's really needed.
> > > 
> > > > Any plans for an alloc_pernode?
> > > 
> > > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
> > 
> > Just if you knew about plans. I won't get too much time to work on
> 
> Not aware of anyone working on it.
> 
> > it next week, so I hope to have something in slab tree in the
> > meantime. I think it is OK to leave now, with a mind to improving
> 
> Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
> because they explode with CONFIG_MAXSMP.

This is a linux-next merge, I'm talking about. The point is to get
some parallelism between testing and making slqb perfect (not because
I don't agree with the problem you point out).

 
> > it before a possible mainline merge (there will possibly be more
> > serious issues discovered anyway).
> 
> I see you fixed the static arrays.
> 
> Doing the same for the kmem_cache arrays with making them a pointer
> and then using num_possible_{cpus,nodes}() would seem straight forward,
> wouldn't it?

Hmm, yes that might be the way to go. I'll do that with the node
array, the cpu array can stay where it is (this reduces cacheline
footprint for small NR_CPUS configs).

 
> Although I think I would prefer alloc_percpu, possibly with
> per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

I don't think we have the NUMA information available early enough
to do that. But it would be the best idea indeed because it would
take advantage of improvements in percpu allocator.


> > But then it goes online and what happens? 
> 
> You already have a node online notifier that should handle that then, don't you?
> 
> x86-64 btw currently doesn't support node hotplug (but I expect it will
> be added at some point), but it should be ok even on architectures
> that do.
> 
> > Your numa_node_id() changes?
> 
> What do you mean?
> 
> > How does that work? Or you mean x86-64 does not do that same trick for
> > possible but offline nodes?
> 
> All I'm saying is that when x86-64 finds a memory less node it assigns
> its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
> sets it with kmalloc_node() but that should normally not happen I think.

OK, but if it is _possible_ for the node to gain memory, then you
can't do that of course. If the node is always memoryless then yes
I think it is probably a good idea to just assign it to the closest node
with memory.


> > OK, but this code is just copied straight from SLAB... I don't want
> > to add such dependency at this point I'm trying to get something
> 
> I'm sure such a straight forward change could be still put into .29
> 
> > reasonable to merge. But it would be a fine cleanup.
> 
> Hmm to be honest it's a little weird to post so much code and then
> say you can't change large parts of it.

The cache_line_size() change wouldn't change slqb code significantly.
I have no problem with it, but I simply won't have time to do it and
test all architectures and get them merged and hold off merging
SLQB until they all get merged.


> Could you perhaps mark all the code you don't want to change?

Primarily the debug code from SLUB.
 

> I'm not sure I follow the rationale for not changing code that has been
> copied from elsewhere. If you copied it why can't you change it?

I have, very extensively. Just diff mm/slqb.c mm/slub.c ;)

The point of not cleaning up peripheral (non-core) code that works, and
exists upstream, is because it will actually be less hassle for me to
maintain. By all means make improvements to the slub version which I can
then pull into slqb.


> > That's my plan, but I go about it a different way ;) I don't want to
> > spend too much time on other allocators or cleanup etc code too much
> > right now (except cleanups in SLQB, which of course is required).
> 
> But still if you copy code from slub you can improve it, can't you?
> The sysfs code definitely could be done much nicer (ok for small values
> of "nice"; sysfs is always ugly of course @). But at least it can be
> done in a way that doesn't bloat the text so much.

I'm definitely not adverse to cleanups at all, but I just want to try
avoid duplicating work or diverging if it is not necessary which makes
it harder to track fixes etc. Just in this point in development...


> Thanks for the patch.
> 
> One thing I'm not sure about is using a private lock to hold off hotplug.
> I don't have a concrete scenario, but it makes me uneasy considering
> deadlocks when someone sleeps etc. Safer is get/put_online_cpus() 

I think it is OK, considering those locks must usually be taken anyway
in the path, I've just tended to widen the coverage. But I'll think if
anything can be improved with get/put API.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 14:27             ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 14:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 03:04:06PM +0100, Andi Kleen wrote:
> [dropping lameters' outdated address]
> 
> On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
> >  
> > > > Probably the best way would
> > > > be to have dynamic cpu and node allocs for them, I agree.
> > > 
> > > It's really needed.
> > > 
> > > > Any plans for an alloc_pernode?
> > > 
> > > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
> > 
> > Just if you knew about plans. I won't get too much time to work on
> 
> Not aware of anyone working on it.
> 
> > it next week, so I hope to have something in slab tree in the
> > meantime. I think it is OK to leave now, with a mind to improving
> 
> Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
> because they explode with CONFIG_MAXSMP.

This is a linux-next merge, I'm talking about. The point is to get
some parallelism between testing and making slqb perfect (not because
I don't agree with the problem you point out).

 
> > it before a possible mainline merge (there will possibly be more
> > serious issues discovered anyway).
> 
> I see you fixed the static arrays.
> 
> Doing the same for the kmem_cache arrays with making them a pointer
> and then using num_possible_{cpus,nodes}() would seem straight forward,
> wouldn't it?

Hmm, yes that might be the way to go. I'll do that with the node
array, the cpu array can stay where it is (this reduces cacheline
footprint for small NR_CPUS configs).

 
> Although I think I would prefer alloc_percpu, possibly with
> per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

I don't think we have the NUMA information available early enough
to do that. But it would be the best idea indeed because it would
take advantage of improvements in percpu allocator.


> > But then it goes online and what happens? 
> 
> You already have a node online notifier that should handle that then, don't you?
> 
> x86-64 btw currently doesn't support node hotplug (but I expect it will
> be added at some point), but it should be ok even on architectures
> that do.
> 
> > Your numa_node_id() changes?
> 
> What do you mean?
> 
> > How does that work? Or you mean x86-64 does not do that same trick for
> > possible but offline nodes?
> 
> All I'm saying is that when x86-64 finds a memory less node it assigns
> its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
> sets it with kmalloc_node() but that should normally not happen I think.

OK, but if it is _possible_ for the node to gain memory, then you
can't do that of course. If the node is always memoryless then yes
I think it is probably a good idea to just assign it to the closest node
with memory.


> > OK, but this code is just copied straight from SLAB... I don't want
> > to add such dependency at this point I'm trying to get something
> 
> I'm sure such a straight forward change could be still put into .29
> 
> > reasonable to merge. But it would be a fine cleanup.
> 
> Hmm to be honest it's a little weird to post so much code and then
> say you can't change large parts of it.

The cache_line_size() change wouldn't change slqb code significantly.
I have no problem with it, but I simply won't have time to do it and
test all architectures and get them merged and hold off merging
SLQB until they all get merged.


> Could you perhaps mark all the code you don't want to change?

Primarily the debug code from SLUB.
 

> I'm not sure I follow the rationale for not changing code that has been
> copied from elsewhere. If you copied it why can't you change it?

I have, very extensively. Just diff mm/slqb.c mm/slub.c ;)

The point of not cleaning up peripheral (non-core) code that works, and
exists upstream, is because it will actually be less hassle for me to
maintain. By all means make improvements to the slub version which I can
then pull into slqb.


> > That's my plan, but I go about it a different way ;) I don't want to
> > spend too much time on other allocators or cleanup etc code too much
> > right now (except cleanups in SLQB, which of course is required).
> 
> But still if you copy code from slub you can improve it, can't you?
> The sysfs code definitely could be done much nicer (ok for small values
> of "nice"; sysfs is always ugly of course @). But at least it can be
> done in a way that doesn't bloat the text so much.

I'm definitely not adverse to cleanups at all, but I just want to try
avoid duplicating work or diverging if it is not necessary which makes
it harder to track fixes etc. Just in this point in development...


> Thanks for the patch.
> 
> One thing I'm not sure about is using a private lock to hold off hotplug.
> I don't have a concrete scenario, but it makes me uneasy considering
> deadlocks when someone sleeps etc. Safer is get/put_online_cpus() 

I think it is OK, considering those locks must usually be taken anyway
in the path, I've just tended to widen the coverage. But I'll think if
anything can be improved with get/put API.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 14:23         ` Hugh Dickins
@ 2009-01-23 14:30           ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 14:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > more than SLAB) with swapping loads on most of my machines.  Though
> > > > oddly one seems immune, and another takes four times as long: guess
> > > > it depends on how close to thrashing, but probably more to investigate
> > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > FWIW - swapping loads are not what anybody should tune for.

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > What kind of machine are you seeing this on? It sounds like it could
> > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > ("slub: Calculate min_objects based on number of processors").

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > Thanks, yes, that could well account for the residual difference: the
> > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > 
> > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > lines (though I'll need to curtail tests on a couple of machines),
> > and will report back later.

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> swapping load.  I've not tried slub_max_order=0, but I'm running
> with 8kB stacks, so order 1 seems a reasonable choice.

Yanmin/Christoph, maybe we should revisit the min objects logic
calculate_order()?

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> I can't say where I pulled that "e.g. 2% slower" from: on different
> machines slub was 5% or 10% or 20% slower than slab and slqb even with
> slub_max_order=1 (but not significantly slower on the "immune" machine).
> How much slub_min_objects=4 helps again varies widely, between halving
> or eliminating the difference.
> 
> But I think it's more important that I focus on the worst case machine,
> try to understand what's going on there.

Yeah. Oprofile and CONFIG_SLUB_STATS are usually quite helpful. You
might want to test the included patch which targets one known SLAB vs.
SLUB regression discovered quite recently.

			Pekka

Subject: [PATCH] SLUB: revert direct page allocator pass through
From: Pekka Enberg <penberg@cs.helsinki.fi>

This patch reverts page allocator pass-through logic from the SLUB allocator.

Commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests") added page allocator pass-through to the
SLUB allocator for large sized allocations. This, however, results in a
performance regression compared to SLAB in the netperf UDP-U-4k test.

The regression comes from the kfree(skb->head) call in skb_release_data() that
is subject to page allocator pass-through as the size passed to __alloc_skb()
is larger than 4 KB in this test. With this patch, the performance regression
is almost closed:

  <insert numbers here>

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
 	if (!size)
 		return 0;
 
+	if (size > KMALLOC_MAX_SIZE)
+		return -1;
+
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <=       1024) return 10;
 	if (size <=   2 * 1024) return 11;
 	if (size <=   4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
 	if (size <=   8 * 1024) return 13;
 	if (size <=  16 * 1024) return 14;
 	if (size <=  32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <= 512 * 1024) return 19;
 	if (size <= 1024 * 1024) return 20;
 	if (size <=  2 * 1024 * 1024) return 21;
+	if (size <=  4 * 1024 * 1024) return 22;
+	if (size <=  8 * 1024 * 1024) return 23;
+	if (size <= 16 * 1024 * 1024) return 24;
+	if (size <= 32 * 1024 * 1024) return 25;
 	return -1;
 
 /*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 	if (index == 0)
 		return NULL;
 
+	/*
+	 * This function only gets expanded if __builtin_constant_p(size), so
+	 * testing it here shouldn't be needed.  But some versions of gcc need
+	 * help.
+	 */
+	if (__builtin_constant_p(size) && index < 0) {
+		/*
+		 * Generate a link failure. Would be great if we could
+		 * do something to stop the compile here.
+		 */
+		extern void __kmalloc_size_too_large(void);
+		__kmalloc_size_too_large();
+	}
 	return &kmalloc_caches[index];
 }
 
@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
-			return kmalloc_large(size, flags);
-
 		if (!(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 			return ZERO_SIZE_PTR;
 
 		index = size_index[(size - 1) / 8];
-	} else
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
 		index = fls(size - 1);
+	}
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, flags);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
-	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
-						get_order(size));
-
-	if (page)
-		return page_address(page);
-	else
-		return NULL;
-}
-
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, flags, node);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
 		return;
 
 	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		put_page(page);
+	if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
 		return;
-	}
 	slab_free(page->slab, page, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, gfpflags);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, gfpflags, node);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))



^ permalink raw reply related	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 14:30           ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 14:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > more than SLAB) with swapping loads on most of my machines.  Though
> > > > oddly one seems immune, and another takes four times as long: guess
> > > > it depends on how close to thrashing, but probably more to investigate
> > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > FWIW - swapping loads are not what anybody should tune for.

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > What kind of machine are you seeing this on? It sounds like it could
> > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > ("slub: Calculate min_objects based on number of processors").

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > Thanks, yes, that could well account for the residual difference: the
> > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > 
> > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > lines (though I'll need to curtail tests on a couple of machines),
> > and will report back later.

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> swapping load.  I've not tried slub_max_order=0, but I'm running
> with 8kB stacks, so order 1 seems a reasonable choice.

Yanmin/Christoph, maybe we should revisit the min objects logic
calculate_order()?

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> I can't say where I pulled that "e.g. 2% slower" from: on different
> machines slub was 5% or 10% or 20% slower than slab and slqb even with
> slub_max_order=1 (but not significantly slower on the "immune" machine).
> How much slub_min_objects=4 helps again varies widely, between halving
> or eliminating the difference.
> 
> But I think it's more important that I focus on the worst case machine,
> try to understand what's going on there.

Yeah. Oprofile and CONFIG_SLUB_STATS are usually quite helpful. You
might want to test the included patch which targets one known SLAB vs.
SLUB regression discovered quite recently.

			Pekka

Subject: [PATCH] SLUB: revert direct page allocator pass through
From: Pekka Enberg <penberg@cs.helsinki.fi>

This patch reverts page allocator pass-through logic from the SLUB allocator.

Commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests") added page allocator pass-through to the
SLUB allocator for large sized allocations. This, however, results in a
performance regression compared to SLAB in the netperf UDP-U-4k test.

The regression comes from the kfree(skb->head) call in skb_release_data() that
is subject to page allocator pass-through as the size passed to __alloc_skb()
is larger than 4 KB in this test. With this patch, the performance regression
is almost closed:

  <insert numbers here>

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
 	if (!size)
 		return 0;
 
+	if (size > KMALLOC_MAX_SIZE)
+		return -1;
+
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <=       1024) return 10;
 	if (size <=   2 * 1024) return 11;
 	if (size <=   4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
 	if (size <=   8 * 1024) return 13;
 	if (size <=  16 * 1024) return 14;
 	if (size <=  32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
 	if (size <= 512 * 1024) return 19;
 	if (size <= 1024 * 1024) return 20;
 	if (size <=  2 * 1024 * 1024) return 21;
+	if (size <=  4 * 1024 * 1024) return 22;
+	if (size <=  8 * 1024 * 1024) return 23;
+	if (size <= 16 * 1024 * 1024) return 24;
+	if (size <= 32 * 1024 * 1024) return 25;
 	return -1;
 
 /*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 	if (index == 0)
 		return NULL;
 
+	/*
+	 * This function only gets expanded if __builtin_constant_p(size), so
+	 * testing it here shouldn't be needed.  But some versions of gcc need
+	 * help.
+	 */
+	if (__builtin_constant_p(size) && index < 0) {
+		/*
+		 * Generate a link failure. Would be great if we could
+		 * do something to stop the compile here.
+		 */
+		extern void __kmalloc_size_too_large(void);
+		__kmalloc_size_too_large();
+	}
 	return &kmalloc_caches[index];
 }
 
@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
-			return kmalloc_large(size, flags);
-
 		if (!(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 			return ZERO_SIZE_PTR;
 
 		index = size_index[(size - 1) / 8];
-	} else
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
 		index = fls(size - 1);
+	}
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, flags);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
-	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
-						get_order(size));
-
-	if (page)
-		return page_address(page);
-	else
-		return NULL;
-}
-
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, flags, node);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
 		return;
 
 	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		put_page(page);
+	if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
 		return;
-	}
 	slab_free(page->slab, page, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, gfpflags);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, gfpflags, node);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 14:27             ` Nick Piggin
@ 2009-01-23 15:06               ` Andi Kleen
  -1 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 15:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
>  
> > Although I think I would prefer alloc_percpu, possibly with
> > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
> 
> I don't think we have the NUMA information available early enough
> to do that. 

How early? At mem_init time it should be there because bootmem needed
it already. It meaning the architectural level NUMA information.

> OK, but if it is _possible_ for the node to gain memory, then you
> can't do that of course. 

In theory it could gain memory through memory hotplug.

> > I'm sure such a straight forward change could be still put into .29
> > 
> > > reasonable to merge. But it would be a fine cleanup.
> > 
> > Hmm to be honest it's a little weird to post so much code and then
> > say you can't change large parts of it.
> 
> The cache_line_size() change wouldn't change slqb code significantly.
> I have no problem with it, but I simply won't have time to do it and
> test all architectures and get them merged and hold off merging
> SLQB until they all get merged.

I was mainly refering to the sysfs code here.
 
 
> > Could you perhaps mark all the code you don't want to change?
> 
> Primarily the debug code from SLUB.

Ok so you could fix the sysfs code? @)

Anyways, if you have such shared pieces perhaps it would be better
if you just pull them all out into a separate file. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:06               ` Andi Kleen
  0 siblings, 0 replies; 197+ messages in thread
From: Andi Kleen @ 2009-01-23 15:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
>  
> > Although I think I would prefer alloc_percpu, possibly with
> > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
> 
> I don't think we have the NUMA information available early enough
> to do that. 

How early? At mem_init time it should be there because bootmem needed
it already. It meaning the architectural level NUMA information.

> OK, but if it is _possible_ for the node to gain memory, then you
> can't do that of course. 

In theory it could gain memory through memory hotplug.

> > I'm sure such a straight forward change could be still put into .29
> > 
> > > reasonable to merge. But it would be a fine cleanup.
> > 
> > Hmm to be honest it's a little weird to post so much code and then
> > say you can't change large parts of it.
> 
> The cache_line_size() change wouldn't change slqb code significantly.
> I have no problem with it, but I simply won't have time to do it and
> test all architectures and get them merged and hold off merging
> SLQB until they all get merged.

I was mainly refering to the sysfs code here.
 
 
> > Could you perhaps mark all the code you don't want to change?
> 
> Primarily the debug code from SLUB.

Ok so you could fix the sysfs code? @)

Anyways, if you have such shared pieces perhaps it would be better
if you just pull them all out into a separate file. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:06               ` Andi Kleen
@ 2009-01-23 15:15                 ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 15:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 04:06:32PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
> >  
> > > Although I think I would prefer alloc_percpu, possibly with
> > > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
> > 
> > I don't think we have the NUMA information available early enough
> > to do that. 
> 
> How early? At mem_init time it should be there because bootmem needed
> it already. It meaning the architectural level NUMA information.

node_to_cpumask(0) returned 0 at kmem_cache_init time.

 
> > OK, but if it is _possible_ for the node to gain memory, then you
> > can't do that of course. 
> 
> In theory it could gain memory through memory hotplug.

Yes.

 
> > The cache_line_size() change wouldn't change slqb code significantly.
> > I have no problem with it, but I simply won't have time to do it and
> > test all architectures and get them merged and hold off merging
> > SLQB until they all get merged.
> 
> I was mainly refering to the sysfs code here.

OK.


> > > Could you perhaps mark all the code you don't want to change?
> > 
> > Primarily the debug code from SLUB.
> 
> Ok so you could fix the sysfs code? @)
> 
> Anyways, if you have such shared pieces perhaps it would be better
> if you just pull them all out into a separate file. 

I'll see. I do plan to try making improvements to this peripheral
code but it just has to wait a little bit for other improvements
first.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:15                 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 15:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Fri, Jan 23, 2009 at 04:06:32PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
> >  
> > > Although I think I would prefer alloc_percpu, possibly with
> > > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
> > 
> > I don't think we have the NUMA information available early enough
> > to do that. 
> 
> How early? At mem_init time it should be there because bootmem needed
> it already. It meaning the architectural level NUMA information.

node_to_cpumask(0) returned 0 at kmem_cache_init time.

 
> > OK, but if it is _possible_ for the node to gain memory, then you
> > can't do that of course. 
> 
> In theory it could gain memory through memory hotplug.

Yes.

 
> > The cache_line_size() change wouldn't change slqb code significantly.
> > I have no problem with it, but I simply won't have time to do it and
> > test all architectures and get them merged and hold off merging
> > SLQB until they all get merged.
> 
> I was mainly refering to the sysfs code here.

OK.


> > > Could you perhaps mark all the code you don't want to change?
> > 
> > Primarily the debug code from SLUB.
> 
> Ok so you could fix the sysfs code? @)
> 
> Anyways, if you have such shared pieces perhaps it would be better
> if you just pull them all out into a separate file. 

I'll see. I do plan to try making improvements to this peripheral
code but it just has to wait a little bit for other improvements
first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 14:23         ` Hugh Dickins
@ 2009-02-02  3:38           ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-02  3:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter, Christoph Lameter

Hi, Hugh,

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > > >
> > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > more than SLAB) with swapping loads on most of my machines.  Though
Would you like to share your tmpfs loop swap load with me, so I could reproduce
it on my machines? Do your machines run at i386 mode or x86-64 mode? How much
memory do your machines have?

> > > > oddly one seems immune, and another takes four times as long: guess
> > > > it depends on how close to thrashing, but probably more to investigate
> > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > FWIW - swapping loads are not what anybody should tune for.
> > > 
> > > What kind of machine are you seeing this on? It sounds like it could
> > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > ("slub: Calculate min_objects based on number of processors").
As I know little about your workload, I just guess from 'loop swap load' that
your load eats memory quickly and kernel/swap is started to keep a low free
memory.

Commit 9b2cd506e5f2117f94c28a0040bf5da058105316 is just a method to increase
the page order for slub so there more free objects available in a slab. That
promotes performance for many benchmarks if there are enough __free__ pages.
Because memory is cheaper and comparing with cpu number increasing, memory
is increased more rapidly. So we create commit
9b2cd506e5f2117f94c28a0040bf5da058105316. In addition, if we have no this
commit, we will have another similiar commit to just increase slub_min_objects
and slub_max_order.

However, our assumption about free memory seems inappropriate when memory is
hungry just like your case. Function allocate_slab always tries the higher
order firstly. If it fails to get a new slab, it will tries the minimum order.
As for your case, I think the first try always fails, and it takes too much
time. Perhaps alloc_pages does far away from a checking even with flag
__GFP_NORETRY to consume extra time?


Christoph and Pekka,

Can we add a checking about free memory page number/percentage in function
allocate_slab that we can bypass the first try of alloc_pages when memory
is hungry?


> > 
> > Thanks, yes, that could well account for the residual difference: the
> > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > 
> > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > lines (though I'll need to curtail tests on a couple of machines),
> > and will report back later.
> 
> Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> swapping load.  I've not tried slub_max_order=0, but I'm running
> with 8kB stacks, so order 1 seems a reasonable choice.
> 
> I can't say where I pulled that "e.g. 2% slower" from: on different
> machines slub was 5% or 10% or 20% slower than slab and slqb even with
> slub_max_order=1 (but not significantly slower on the "immune" machine).
> How much slub_min_objects=4 helps again varies widely, between halving
> or eliminating the difference.
I guess your machines have different memory quantity, but your workload
mostly consumes specified number of pages, so the result percent is
different.

> 
> But I think it's more important that I focus on the worst case machine,
> try to understand what's going on there.
oprofile data and 'slabinfo -AD' output might help.

yanmin



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-02  3:38           ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-02  3:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

Hi, Hugh,

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > > >
> > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > more than SLAB) with swapping loads on most of my machines.  Though
Would you like to share your tmpfs loop swap load with me, so I could reproduce
it on my machines? Do your machines run at i386 mode or x86-64 mode? How much
memory do your machines have?

> > > > oddly one seems immune, and another takes four times as long: guess
> > > > it depends on how close to thrashing, but probably more to investigate
> > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > FWIW - swapping loads are not what anybody should tune for.
> > > 
> > > What kind of machine are you seeing this on? It sounds like it could
> > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > ("slub: Calculate min_objects based on number of processors").
i>>?As I know little about your workload, I just guess from 'loop swap load' that
your load eats memory quickly and kernel/swap is started to keep a low free
memory.

Commit i>>?9b2cd506e5f2117f94c28a0040bf5da058105316 is just a method to increase
the page order for slub so there more free objects available in a slab. That
promotes performance for many benchmarks if there are enough __free__ pages.
Because memory is cheaper and comparing with cpu number increasing, memory
is increased more rapidly. So we create commit
9b2cd506e5f2117f94c28a0040bf5da058105316. In addition, if we have no this
commit, we will have another similiar commit to just increase slub_min_objects
and slub_max_order.

However, our assumption about free memory seems inappropriate when memory is
hungry just like your case. Function allocate_slab always tries the higher
order firstly. If it fails to get a new slab, it will tries the minimum order.
As for your case, I think the first try always fails, and it takes too much
time. Perhaps alloc_pages does far away from a checking even with flag
__GFP_NORETRY to consume extra time?


Christoph and Pekka,

Can we add a checking about free memory page number/percentage in function
allocate_slab that we can bypass the first try of alloc_pages when memory
is hungry?


> > 
> > Thanks, yes, that could well account for the residual difference: the
> > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > 
> > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > lines (though I'll need to curtail tests on a couple of machines),
> > and will report back later.
> 
> Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> swapping load.  I've not tried slub_max_order=0, but I'm running
> with 8kB stacks, so order 1 seems a reasonable choice.
> 
> I can't say where I pulled that "e.g. 2% slower" from: on different
> machines slub was 5% or 10% or 20% slower than slab and slqb even with
> slub_max_order=1 (but not significantly slower on the "immune" machine).
> How much slub_min_objects=4 helps again varies widely, between halving
> or eliminating the difference.
I guess your machines have different memory quantity, but your workload
mostly consumes specified number of pages, so the result percent is
different.

> 
> But I think it's more important that I focus on the worst case machine,
> try to understand what's going on there.
oprofile data and 'slabinfo -AD' output might help.

yanmin


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-02  3:38           ` Zhang, Yanmin
@ 2009-02-02  9:00             ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-02  9:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

Hi Yanmin,

On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> Can we add a checking about free memory page number/percentage in function
> allocate_slab that we can bypass the first try of alloc_pages when memory
> is hungry?

If the check isn't too expensive, I don't any reason not to. How would
you go about checking how much free pages there are, though? Is there
something in the page allocator that we can use for this?

		Pekka


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-02  9:00             ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-02  9:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

Hi Yanmin,

On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> Can we add a checking about free memory page number/percentage in function
> allocate_slab that we can bypass the first try of alloc_pages when memory
> is hungry?

If the check isn't too expensive, I don't any reason not to. How would
you go about checking how much free pages there are, though? Is there
something in the page allocator that we can use for this?

		Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-02  3:38           ` Zhang, Yanmin
  (?)
  (?)
@ 2009-02-02 11:50           ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-02 11:50 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7585 bytes --]

On Mon, 2 Feb 2009, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> > On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > > On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@veritas.com> wrote:
> > > > >
> > > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > > more than SLAB) with swapping loads on most of my machines.  Though
> Would you like to share your tmpfs loop swap load with me, so I could reproduce
> it on my machines?

A very reasonable request that I feared someone would make!
I'm sure we all have test scripts that we can run happily ourselves,
but as soon as someone else asks, we want to make this and that and
the other adjustment, if only to reduce the amount of setup description
required - this is one such.  I guess I can restrain myself a little if
I'm just sending it to you, separately.

> Do your machines run at i386 mode or x86-64 mode?

Both: one is a ppc64 (G5 Quad), one is i386 only (Atom SSD netbook),
three I can run either way (though my common habit is to run two as
i386 with 32bit userspace and one as x86_64 with 64bit userspace).

> How much memory do your machines have?

I use mem=700M when running such tests on all of them (but leave
the netbook with its 1GB mem): otherwise I'd have to ramp up the
test in different ways to get them all swapping enough - it is
tmpfs and swapping that I'm personally most concerned to test.

> > > > > oddly one seems immune, and another takes four times as long: guess
> > > > > it depends on how close to thrashing, but probably more to investigate
> > > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > > FWIW - swapping loads are not what anybody should tune for.
> > > > 
> > > > What kind of machine are you seeing this on? It sounds like it could
> > > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > > ("slub: Calculate min_objects based on number of processors").
> As I know little about your workload, I just guess from 'loop swap load' that
> your load eats memory quickly and kernel/swap is started to keep a low free
> memory.
> 
> Commit 9b2cd506e5f2117f94c28a0040bf5da058105316 is just a method to increase
> the page order for slub so there more free objects available in a slab. That
> promotes performance for many benchmarks if there are enough __free__ pages.
> Because memory is cheaper and comparing with cpu number increasing, memory
> is increased more rapidly. So we create commit
> 9b2cd506e5f2117f94c28a0040bf5da058105316. In addition, if we have no this
> commit, we will have another similiar commit to just increase slub_min_objects
> and slub_max_order.
> 
> However, our assumption about free memory seems inappropriate when memory is
> hungry just like your case. Function allocate_slab always tries the higher
> order firstly. If it fails to get a new slab, it will tries the minimum order.
> As for your case, I think the first try always fails, and it takes too much
> time. Perhaps alloc_pages does far away from a checking even with flag
> __GFP_NORETRY to consume extra time?

I believe you're thinking there of how much system time is used.
I haven't been paying much attention to that, and don't have any
complaints about slub from that angle (what's most noticeable there
is that, as expected, slob uses more system time than slab or slqb
or slub).  Although I do record the system time reported for the
test, I very rarely think to add in kswapd0's and loop0's times,
which would be very significant missed contributions.

What I've been worried by is the total elapsed times, that's where
slub shows up badly.  That means, I think, that bad decisions are
being made about what to swap out when, so that altogether there's
too much swapping: which is understandable when slub is aiming for
higher order allocations.  One page of the high order is selected
according to vmscan's usual criteria, but the remaining pages will
be chosen according to their adjacence rather than their age (to
some extent: there is code in there to resist bad decisions too).
If we imagine that vmscan's usual criteria are perfect (ha ha),
then it's unsurprising that going for higher order allocations
leads it to make inferior decisions and swap out too much.

> 
> Christoph and Pekka,
> 
> Can we add a checking about free memory page number/percentage in function
> allocate_slab that we can bypass the first try of alloc_pages when memory
> is hungry?

Having lots of free memory is a temporary accident following process
exit (when lots of anonymous memory has suddenly been freed), before
it has been put to use for page cache.  The kernel tries to run with
a certain amount of free memory in reserve, and the rest of memory
put to (potentially) good use.  I don't think we have the number
you're looking for there, though perhaps some approximation could
be devised (or I'm looking at the problem the wrong way round).

Perhaps feedback from vmscan.c, on how much it's having to write back,
would provide a good clue.  There's plenty of stats maintained there.

> > > 
> > > Thanks, yes, that could well account for the residual difference: the
> > > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > > 
> > > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > > lines (though I'll need to curtail tests on a couple of machines),
> > > and will report back later.
> > 
> > Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> > swapping load.  I've not tried slub_max_order=0, but I'm running
> > with 8kB stacks, so order 1 seems a reasonable choice.
> > 
> > I can't say where I pulled that "e.g. 2% slower" from: on different
> > machines slub was 5% or 10% or 20% slower than slab and slqb even with
> > slub_max_order=1 (but not significantly slower on the "immune" machine).
> > How much slub_min_objects=4 helps again varies widely, between halving
> > or eliminating the difference.
> I guess your machines have different memory quantity, but your workload
> mostly consumes specified number of pages, so the result percent is
> different.

No, mem=700M in each case but the netbook.

> > 
> > But I think it's more important that I focus on the worst case machine,
> > try to understand what's going on there.
> oprofile data and 'slabinfo -AD' output might help.

oprofile I doubt here, since it's the total elapsed time that worries
me.  I had to look up 'slabinfo -AD', yes, thanks for that pointer, it
may help when I get around to investigating my totally unsubstantiated
suspicion ...

... on the laptop which suffers worst from slub, I am using an SD
card accessed as USB storage for swap (but no USB storage on the
others).  I'm suspecting there's something down that stack which
is slow to recover from allocation failures: when I tried a much
simplified test using just two "cp -a"s, they can hang on that box.
So my current guess is that slub makes something significantly worse
(some debug options make it significantly worse too), but the actual
bug is elsewhere.

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-02  9:00             ` Pekka Enberg
@ 2009-02-02 15:00               ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-02 15:00 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Hugh Dickins, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Mon, 2 Feb 2009, Pekka Enberg wrote:

> Hi Yanmin,
>
> On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > Can we add a checking about free memory page number/percentage in function
> > allocate_slab that we can bypass the first try of alloc_pages when memory
> > is hungry?
>
> If the check isn't too expensive, I don't any reason not to. How would
> you go about checking how much free pages there are, though? Is there
> something in the page allocator that we can use for this?

If the free memory is low then reclaim needs to be run to increase the
free memory. Falling back immediately incurs the overhead of going through
the order 0 queues.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-02 15:00               ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-02 15:00 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Hugh Dickins, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Mon, 2 Feb 2009, Pekka Enberg wrote:

> Hi Yanmin,
>
> On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > Can we add a checking about free memory page number/percentage in function
> > allocate_slab that we can bypass the first try of alloc_pages when memory
> > is hungry?
>
> If the check isn't too expensive, I don't any reason not to. How would
> you go about checking how much free pages there are, though? Is there
> something in the page allocator that we can use for this?

If the free memory is low then reclaim needs to be run to increase the
free memory. Falling back immediately incurs the overhead of going through
the order 0 queues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-02 15:00               ` Christoph Lameter
@ 2009-02-03  1:34                 ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-03  1:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Hugh Dickins, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Mon, 2009-02-02 at 10:00 -0500, Christoph Lameter wrote:
> On Mon, 2 Feb 2009, Pekka Enberg wrote:
> 
> > Hi Yanmin,
> >
> > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > Can we add a checking about free memory page number/percentage in function
> > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > is hungry?
> >
> > If the check isn't too expensive, I don't any reason not to. How would
> > you go about checking how much free pages there are, though? Is there
> > something in the page allocator that we can use for this?
> 
> If the free memory is low then reclaim needs to be run to increase the
> free memory.
I think reclaim did start often with Hugh's case. There would be no swap if not.

>  Falling back immediately incurs the overhead of going through
> the order 0 queues.
The falling back is temporal. Later on when there is enough free pages available,
new slab allocations go back to higher order automatically. This is to save the first
high-order allocation try because it often fails if memory is hungry.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03  1:34                 ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-03  1:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Hugh Dickins, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Mon, 2009-02-02 at 10:00 -0500, Christoph Lameter wrote:
> On Mon, 2 Feb 2009, Pekka Enberg wrote:
> 
> > Hi Yanmin,
> >
> > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > Can we add a checking about free memory page number/percentage in function
> > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > is hungry?
> >
> > If the check isn't too expensive, I don't any reason not to. How would
> > you go about checking how much free pages there are, though? Is there
> > something in the page allocator that we can use for this?
> 
> If the free memory is low then reclaim needs to be run to increase the
> free memory.
I think reclaim did start often with Hugh's case. There would be no swap if not.

>  Falling back immediately incurs the overhead of going through
> the order 0 queues.
The falling back is temporal. Later on when there is enough free pages available,
new slab allocations go back to higher order automatically. This is to save the first
high-order allocation try because it often fails if memory is hungry.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-02  9:00             ` Pekka Enberg
@ 2009-02-03  7:29               ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-03  7:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> Hi Yanmin,
> 
> On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > Can we add a checking about free memory page number/percentage in function
> > allocate_slab that we can bypass the first try of alloc_pages when memory
> > is hungry?
> 
> If the check isn't too expensive, I don't any reason not to. How would
> you go about checking how much free pages there are, though? Is there
> something in the page allocator that we can use for this?

We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
patch is a try. I tested it with hackbench and tbench on my stoakley
(2 qual-core processors) and tigerton (4 qual-core processors). There is almost no
regression.

Besides this patch, I have another patch to try to reduce the calculation
of "totalram_pages - hugetlb_total_pages()", but it touches many files. So just
post the first simple patch here for review.


Hugh,

Would you like to test it on your machines?

Thanks,
Yanmin


---

--- linux-2.6.29-rc2/mm/slub.c	2009-01-20 14:20:45.000000000 +0800
+++ linux-2.6.29-rc2_slubfreecheck/mm/slub.c	2009-02-03 14:40:52.000000000 +0800
@@ -23,6 +23,8 @@
 #include <linux/debugobjects.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
+#include <linux/swap.h>
+#include <linux/hugetlb.h>
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
 
@@ -1076,14 +1078,18 @@ static inline struct page *alloc_slab_pa
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	unsigned long free_pages = nr_free_pages();
+	unsigned long total_pages = totalram_pages - hugetlb_total_pages();
 
 	flags |= s->allocflags;
 
-	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
-									oo);
-	if (unlikely(!page)) {
+	if (free_pages > total_pages >> 3) {
+		page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY,
+				node, oo);
+	}
+	if (!page) {
 		oo = s->min;
 		/*
 		 * Allocation may have failed due to fragmentation.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03  7:29               ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-03  7:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> Hi Yanmin,
> 
> On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > Can we add a checking about free memory page number/percentage in function
> > allocate_slab that we can bypass the first try of alloc_pages when memory
> > is hungry?
> 
> If the check isn't too expensive, I don't any reason not to. How would
> you go about checking how much free pages there are, though? Is there
> something in the page allocator that we can use for this?

i>>?We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
patch is a try. I tested it with hackbench and tbench on my stoakley
(2 qual-core processors) and tigerton (4 qual-core processors). There is almost no
regression.

Besides this patch, I have another patch to try to reduce the calculation
of "i>>?totalram_pages - hugetlb_total_pages()", but it touches many files. So just
post the first simple patch here for review.


Hugh,

Would you like to test it on your machines?

Thanks,
Yanmin


---

--- linux-2.6.29-rc2/mm/slub.c	2009-01-20 14:20:45.000000000 +0800
+++ linux-2.6.29-rc2_slubfreecheck/mm/slub.c	2009-02-03 14:40:52.000000000 +0800
@@ -23,6 +23,8 @@
 #include <linux/debugobjects.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
+#include <linux/swap.h>
+#include <linux/hugetlb.h>
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
 
@@ -1076,14 +1078,18 @@ static inline struct page *alloc_slab_pa
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	unsigned long free_pages = nr_free_pages();
+	unsigned long total_pages = totalram_pages - hugetlb_total_pages();
 
 	flags |= s->allocflags;
 
-	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
-									oo);
-	if (unlikely(!page)) {
+	if (free_pages > total_pages >> 3) {
+		page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY,
+				node, oo);
+	}
+	if (!page) {
 		oo = s->min;
 		/*
 		 * Allocation may have failed due to fragmentation.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03  7:29               ` Zhang, Yanmin
@ 2009-02-03 12:18                 ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-03 12:18 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > Can we add a checking about free memory page number/percentage in function
> > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > is hungry?
> > 
> > If the check isn't too expensive, I don't any reason not to. How would
> > you go about checking how much free pages there are, though? Is there
> > something in the page allocator that we can use for this?
> 
> We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
> patch is a try. I tested it with hackbench and tbench on my stoakley
> (2 qual-core processors) and tigerton (4 qual-core processors).
> There is almost no regression.

May I repeat what I said yesterday?  Certainly I'm oversimplifying,
but if I'm plain wrong, please correct me.

Having lots of free memory is a temporary accident following process
exit (when lots of anonymous memory has suddenly been freed), before
it has been put to use for page cache.  The kernel tries to run with
a certain amount of free memory in reserve, and the rest of memory
put to (potentially) good use.  I don't think we have the number
you're looking for there, though perhaps some approximation could
be devised (or I'm looking at the problem the wrong way round).

Perhaps feedback from vmscan.c, on how much it's having to write back,
would provide a good clue.  There's plenty of stats maintained there.

> 
> Besides this patch, I have another patch to try to reduce the calculation
> of "totalram_pages - hugetlb_total_pages()", but it touches many files.
> So just post the first simple patch here for review.
> 
> 
> Hugh,
> 
> Would you like to test it on your machines?

Indeed I shall, starting in a few hours when I've finished with trying
the script I promised yesterday to send you.  And I won't be at all
surprised if your patch eliminates my worst cases, because I don't
expect to have any significant amount of free memory during my testing,
and my swap testing suffers from slub's thirst for higher orders.

But I don't believe the kind of check you're making is appropriate,
and I do believe that when you try more extensive testing, you'll find
regressions in other tests which were relying on the higher orders.
If all of your testing happens to have lots of free memory around,
I'm surprised; but perhaps I'm naive about how things actually work,
especially on the larger machines.

Or maybe your tests are relying crucially on the slabs allocated at
system startup, when of course there should be plenty of free memory
around.

By the way, when I went to remind myself of what nr_free_pages()
actually does, my grep immediately hit this remark in mm/mmap.c:
		 * nr_free_pages() is very expensive on large systems,
I hope that's just a stale comment from before it was converted
to global_page_state(NR_FREE_PAGES)!

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03 12:18                 ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-03 12:18 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > Can we add a checking about free memory page number/percentage in function
> > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > is hungry?
> > 
> > If the check isn't too expensive, I don't any reason not to. How would
> > you go about checking how much free pages there are, though? Is there
> > something in the page allocator that we can use for this?
> 
> We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
> patch is a try. I tested it with hackbench and tbench on my stoakley
> (2 qual-core processors) and tigerton (4 qual-core processors).
> There is almost no regression.

May I repeat what I said yesterday?  Certainly I'm oversimplifying,
but if I'm plain wrong, please correct me.

Having lots of free memory is a temporary accident following process
exit (when lots of anonymous memory has suddenly been freed), before
it has been put to use for page cache.  The kernel tries to run with
a certain amount of free memory in reserve, and the rest of memory
put to (potentially) good use.  I don't think we have the number
you're looking for there, though perhaps some approximation could
be devised (or I'm looking at the problem the wrong way round).

Perhaps feedback from vmscan.c, on how much it's having to write back,
would provide a good clue.  There's plenty of stats maintained there.

> 
> Besides this patch, I have another patch to try to reduce the calculation
> of "totalram_pages - hugetlb_total_pages()", but it touches many files.
> So just post the first simple patch here for review.
> 
> 
> Hugh,
> 
> Would you like to test it on your machines?

Indeed I shall, starting in a few hours when I've finished with trying
the script I promised yesterday to send you.  And I won't be at all
surprised if your patch eliminates my worst cases, because I don't
expect to have any significant amount of free memory during my testing,
and my swap testing suffers from slub's thirst for higher orders.

But I don't believe the kind of check you're making is appropriate,
and I do believe that when you try more extensive testing, you'll find
regressions in other tests which were relying on the higher orders.
If all of your testing happens to have lots of free memory around,
I'm surprised; but perhaps I'm naive about how things actually work,
especially on the larger machines.

Or maybe your tests are relying crucially on the slabs allocated at
system startup, when of course there should be plenty of free memory
around.

By the way, when I went to remind myself of what nr_free_pages()
actually does, my grep immediately hit this remark in mm/mmap.c:
		 * nr_free_pages() is very expensive on large systems,
I hope that's just a stale comment from before it was converted
to global_page_state(NR_FREE_PAGES)!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 12:18                 ` Hugh Dickins
@ 2009-02-04  2:21                   ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-04  2:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> > > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > > Can we add a checking about free memory page number/percentage in function
> > > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > > is hungry?
> > > 
> > > If the check isn't too expensive, I don't any reason not to. How would
> > > you go about checking how much free pages there are, though? Is there
> > > something in the page allocator that we can use for this?
> > 
> > We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
> > patch is a try. I tested it with hackbench and tbench on my stoakley
> > (2 qual-core processors) and tigerton (4 qual-core processors).
> > There is almost no regression.
> 
> May I repeat what I said yesterday?  Certainly I'm oversimplifying,
> but if I'm plain wrong, please correct me.
I did read your previous email carefully.

> 
> Having lots of free memory is a temporary accident following process
> exit (when lots of anonymous memory has suddenly been freed), before
> it has been put to use for page cache. 
Some workloads do something like what you said. But lots of workloads
don't use up most memory. Keepping an amount of free memory could
quicken up system running.

> The kernel tries to run with
> a certain amount of free memory in reserve, and the rest of memory
> put to (potentially) good use.  I don't think we have the number
> you're looking for there, though perhaps some approximation could
> be devised (or I'm looking at the problem the wrong way round).
> 
> Perhaps feedback from vmscan.c, on how much it's having to write back,
> would provide a good clue.  There's plenty of stats maintained there.
My startpoint is to look for a simple formula and the patch is a RFC.
If we consider statistics of write back (and so on), what I am worried about
is that page allocation called by allocate_slab might cause too many
try_to_free_pages/page reclaim/swap when memory is hungry. Originally, I thought
that's the root cause of your issue.

In other hand, free memory wouldn't vary too much with a good stably
running system. So the patch tries to change allocate_slab:
If there are high-order free pages available without reclaim/swap/free_pages,
just go to the first alloc_slab_page; If not, just jump to the second try of
alloc_slab_page. Of course, the checking is just a guess, not a guarantee.


> 
> > 
> > Besides this patch, I have another patch to try to reduce the calculation
> > of "totalram_pages - hugetlb_total_pages()", but it touches many files.
> > So just post the first simple patch here for review.
> > 
> > 
> > Hugh,
> > 
> > Would you like to test it on your machines?
> 
> Indeed I shall, starting in a few hours when I've finished with trying
> the script I promised yesterday to send you.  And I won't be at all
> surprised if your patch eliminates my worst cases, because I don't
> expect to have any significant amount of free memory during my testing,
> and my swap testing suffers from slub's thirst for higher orders.
> 
> But I don't believe the kind of check you're making is appropriate,
> and I do believe that when you try more extensive testing, you'll find
> regressions in other tests which were relying on the higher orders.
Yes, I agree. And we need find such tests which causes both memory used up
and lots of higher-order allocations.

> If all of your testing happens to have lots of free memory around,
> I'm surprised; but perhaps I'm naive about how things actually work,
> especially on the larger machines.
I use dozeons of benchmarks, but they are mostly microbench. In addition, mostly,
there are lots of free memory when they (except fio and membench) are running.

I tried to enable some big benchmarks, such like specweb2005, but I failed because
Tomcat scalability looks not good on 4 qual-core processor machines.

I would find and add new benchmark/tests which use up memory even cause lots of swap
into my testing framework.

> 
> Or maybe your tests are relying crucially on the slabs allocated at
> system startup, when of course there should be plenty of free memory
> around.
> 
> By the way, when I went to remind myself of what nr_free_pages()
> actually does, my grep immediately hit this remark in mm/mmap.c:
> 		 * nr_free_pages() is very expensive on large systems,
> I hope that's just a stale comment from before it was converted
> to global_page_state(NR_FREE_PAGES)!
I think so.

I would try to reproduce your issue on my machines.

Thanks for your scripts and suggestions.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-04  2:21                   ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-04  2:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > On Mon, 2009-02-02 at 11:00 +0200, Pekka Enberg wrote:
> > > On Mon, 2009-02-02 at 11:38 +0800, Zhang, Yanmin wrote:
> > > > Can we add a checking about free memory page number/percentage in function
> > > > allocate_slab that we can bypass the first try of alloc_pages when memory
> > > > is hungry?
> > > 
> > > If the check isn't too expensive, I don't any reason not to. How would
> > > you go about checking how much free pages there are, though? Is there
> > > something in the page allocator that we can use for this?
> > 
> > We can use nr_free_pages(), totalram_pages and hugetlb_total_pages(). Below
> > patch is a try. I tested it with hackbench and tbench on my stoakley
> > (2 qual-core processors) and tigerton (4 qual-core processors).
> > There is almost no regression.
> 
> May I repeat what I said yesterday?  Certainly I'm oversimplifying,
> but if I'm plain wrong, please correct me.
I did read your previous email carefully.

> 
> Having lots of free memory is a temporary accident following process
> exit (when lots of anonymous memory has suddenly been freed), before
> it has been put to use for page cache. 
Some workloads do something like what you said. But lots of workloads
don't use up most memory. Keepping an amount of free memory could
quicken up system running.

> The kernel tries to run with
> a certain amount of free memory in reserve, and the rest of memory
> put to (potentially) good use.  I don't think we have the number
> you're looking for there, though perhaps some approximation could
> be devised (or I'm looking at the problem the wrong way round).
> 
> Perhaps feedback from vmscan.c, on how much it's having to write back,
> would provide a good clue.  There's plenty of stats maintained there.
My startpoint is to look for a simple formula and the patch is a RFC.
If we consider statistics of write back (and so on), what I am worried about
is that page allocation called by allocate_slab might cause too many
try_to_free_pages/page reclaim/swap when memory is hungry. Originally, I thought
that's the root cause of your issue.

In other hand, free memory wouldn't vary too much with a good stably
running system. So the patch tries to change allocate_slab:
If there are high-order free pages available without reclaim/swap/free_pages,
just go to the first alloc_slab_page; If not, just jump to the second try of
alloc_slab_page. Of course, the checking is just a guess, not a guarantee.


> 
> > 
> > Besides this patch, I have another patch to try to reduce the calculation
> > of "totalram_pages - hugetlb_total_pages()", but it touches many files.
> > So just post the first simple patch here for review.
> > 
> > 
> > Hugh,
> > 
> > Would you like to test it on your machines?
> 
> Indeed I shall, starting in a few hours when I've finished with trying
> the script I promised yesterday to send you.  And I won't be at all
> surprised if your patch eliminates my worst cases, because I don't
> expect to have any significant amount of free memory during my testing,
> and my swap testing suffers from slub's thirst for higher orders.
> 
> But I don't believe the kind of check you're making is appropriate,
> and I do believe that when you try more extensive testing, you'll find
> regressions in other tests which were relying on the higher orders.
Yes, I agree. And we need find such tests which causes both memory used up
and lots of higher-order allocations.

> If all of your testing happens to have lots of free memory around,
> I'm surprised; but perhaps I'm naive about how things actually work,
> especially on the larger machines.
I use dozeons of benchmarks, but they are mostly microbench. In addition, mostly,
there are lots of free memory when they (except fio and membench) are running.

I tried to enable some big benchmarks, such like specweb2005, but I failed because
Tomcat scalability looks not good on 4 qual-core processor machines.

I would find and add new benchmark/tests which use up memory even cause lots of swap
into my testing framework.

> 
> Or maybe your tests are relying crucially on the slabs allocated at
> system startup, when of course there should be plenty of free memory
> around.
> 
> By the way, when I went to remind myself of what nr_free_pages()
> actually does, my grep immediately hit this remark in mm/mmap.c:
> 		 * nr_free_pages() is very expensive on large systems,
> I hope that's just a stale comment from before it was converted
> to global_page_state(NR_FREE_PAGES)!
I think so.

I would try to reproduce your issue on my machines.

Thanks for your scripts and suggestions.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-04  2:21                   ` Zhang, Yanmin
@ 2009-02-05 19:04                     ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-05 19:04 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Wed, 4 Feb 2009, Zhang, Yanmin wrote:
> On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> > On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > > 
> > > Would you like to test it on your machines?
> > 
> > Indeed I shall, starting in a few hours when I've finished with trying
> > the script I promised yesterday to send you.  And I won't be at all
> > surprised if your patch eliminates my worst cases, because I don't
> > expect to have any significant amount of free memory during my testing,
> > and my swap testing suffers from slub's thirst for higher orders.
> > 
> > But I don't believe the kind of check you're making is appropriate,
> > and I do believe that when you try more extensive testing, you'll find
> > regressions in other tests which were relying on the higher orders.
> 
> Yes, I agree. And we need find such tests which causes both memory used up
> and lots of higher-order allocations.

Sceptical though I am about your free_pages test in slub's allocate_slab(),
I can confirm that your patch does well on my swapping loads, performing
slightly (not necessarily significantly) better than slab on those loads
(though not quite as well on the "immune" machine where slub was already
keeping up with slab; and I haven't even bothered to try it on the machine
which behaves so very badly that no conclusions can yet be drawn).

I then tried a patch I thought obviously better than yours: just mask
off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
alloc_slab_page(): so we're not trying to infer anything about high-
order availability from the number of free order-0 pages, but actually
going to look for it and taking it if it's free, forgetting it if not.

That didn't work well at all: almost as bad as the unmodified slub.c.
I decided that was due to __alloc_pages_internal()'s
wakeup_kswapd(zone, order): just expressing an interest in a high-
order page was enough to send it off trying to reclaim them, though
not directly.  Hacked in a condition to suppress that in this case:
worked a lot better, but not nearly as well as yours.  I supposed
that was somehow(?) due to the subsequent get_page_from_freelist()
calls with different watermarking: hacked in another __GFP flag to
break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
Much better, getting close, but still not as good as yours.  

I think I'd better turn back to things I understand better!

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-05 19:04                     ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-05 19:04 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Wed, 4 Feb 2009, Zhang, Yanmin wrote:
> On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> > On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > > 
> > > Would you like to test it on your machines?
> > 
> > Indeed I shall, starting in a few hours when I've finished with trying
> > the script I promised yesterday to send you.  And I won't be at all
> > surprised if your patch eliminates my worst cases, because I don't
> > expect to have any significant amount of free memory during my testing,
> > and my swap testing suffers from slub's thirst for higher orders.
> > 
> > But I don't believe the kind of check you're making is appropriate,
> > and I do believe that when you try more extensive testing, you'll find
> > regressions in other tests which were relying on the higher orders.
> 
> Yes, I agree. And we need find such tests which causes both memory used up
> and lots of higher-order allocations.

Sceptical though I am about your free_pages test in slub's allocate_slab(),
I can confirm that your patch does well on my swapping loads, performing
slightly (not necessarily significantly) better than slab on those loads
(though not quite as well on the "immune" machine where slub was already
keeping up with slab; and I haven't even bothered to try it on the machine
which behaves so very badly that no conclusions can yet be drawn).

I then tried a patch I thought obviously better than yours: just mask
off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
alloc_slab_page(): so we're not trying to infer anything about high-
order availability from the number of free order-0 pages, but actually
going to look for it and taking it if it's free, forgetting it if not.

That didn't work well at all: almost as bad as the unmodified slub.c.
I decided that was due to __alloc_pages_internal()'s
wakeup_kswapd(zone, order): just expressing an interest in a high-
order page was enough to send it off trying to reclaim them, though
not directly.  Hacked in a condition to suppress that in this case:
worked a lot better, but not nearly as well as yours.  I supposed
that was somehow(?) due to the subsequent get_page_from_freelist()
calls with different watermarking: hacked in another __GFP flag to
break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
Much better, getting close, but still not as good as yours.  

I think I'd better turn back to things I understand better!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-05 19:04                     ` Hugh Dickins
@ 2009-02-06  0:47                       ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-06  0:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> On Wed, 4 Feb 2009, Zhang, Yanmin wrote:
> > On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> > > On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > > > 
> > > > Would you like to test it on your machines?
> > > 
> > > Indeed I shall, starting in a few hours when I've finished with trying
> > > the script I promised yesterday to send you.  And I won't be at all
> > > surprised if your patch eliminates my worst cases, because I don't
> > > expect to have any significant amount of free memory during my testing,
> > > and my swap testing suffers from slub's thirst for higher orders.
> > > 
> > > But I don't believe the kind of check you're making is appropriate,
> > > and I do believe that when you try more extensive testing, you'll find
> > > regressions in other tests which were relying on the higher orders.
> > 
> > Yes, I agree. And we need find such tests which causes both memory used up
> > and lots of higher-order allocations.
> 
> Sceptical though I am about your free_pages test in slub's allocate_slab(),
> I can confirm that your patch does well on my swapping loads, performing
> slightly (not necessarily significantly) better than slab on those loads
As matter of fact, the patch has the same effect like slub_max_order=0 on
your workload, except the additional cost to check free pages.

> (though not quite as well on the "immune" machine where slub was already
> keeping up with slab; and I haven't even bothered to try it on the machine
> which behaves so very badly that no conclusions can yet be drawn).
> 
> I then tried a patch I thought obviously better than yours: just mask
> off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> alloc_slab_page(): so we're not trying to infer anything about high-
> order availability from the number of free order-0 pages, but actually
> going to look for it and taking it if it's free, forgetting it if not.
> 
> That didn't work well at all: almost as bad as the unmodified slub.c.
> I decided that was due to __alloc_pages_internal()'s
> wakeup_kswapd(zone, order): just expressing an interest in a high-
> order page was enough to send it off trying to reclaim them, though
> not directly.  Hacked in a condition to suppress that in this case:
> worked a lot better, but not nearly as well as yours.  I supposed
> that was somehow(?) due to the subsequent get_page_from_freelist()
> calls with different watermarking: hacked in another __GFP flag to
> break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> Much better, getting close, but still not as good as yours.  
> 
> I think I'd better turn back to things I understand better!
Your investigation is really detail-focused. I also did some testing.

I changed the script a little. As no the laptop devices which
create the worst result difference, I tested on my stoakley which has
2 qual-core processors, 8GB memory (started kernel with mem=1GB),
a scsi disk as swap partition (35GB). 

The testing runs in a loop. It starts 2 tasks to run kbuild of 2.6.28,
build1 and build2 separately. build1 runs on tmpfs directly. build2 runs
on a ext2 loop fs on tmpfs. Both build untar the source tarball firstly, then
use the defconfig to compile kernel. The script does a sync between build1 and
build2, so they could start at the same time every iteration.

[root@lkp-st02-x8664 ~]# slabinfo -AD|head -n 15
Name                   Objects    Alloc     Free   %Fast
names_cache                 64 11734829 11734830  99  99 
filp                      1195  8484074  8482982  90   3 
vm_area_struct            3830  7688583  7684900  92  54 
buffer_head              33970  3832771  3798977  94   0 
bio-0                     5906  2383929  2378119  91  13 
journal_handle            1360  2182580  2182580  99  99 

As a matter of fact, I got similiar cache statstics with kbuild on different machines.
names_cache's object size is 4096. filp and vm_area_struct's are 192/168.
names_cache's default order is 3. Other active kmem_cache's order is 0.
names_cache is used by getname=>__getname from sys_open/execve/faccessstat, etc.
Although kernel allocates a page every time for names_cache object, mostly, kernel
only uses a dozen of bytes per names_cache object.

With kernel 2.6.29-rc2-slqb0121 (get slqb patch from Pekka's git tree):
Thu Feb  5 15:50:24 CST 2009 2.6.29-rc2slqb0121stat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   144.15  91.70   32.99
build2  159.81  91.83   34.27
Thu Feb  5 15:53:09 CST 2009: 165 secs for 1 iters, 165 secs per iter
build1  123.02  90.29   33.08
build2  204.52  90.28   34.17
Thu Feb  5 15:56:39 CST 2009: 375 secs for 2 iters, 187 secs per iter
build1  132.74  90.60   33.45
build2  210.11  90.80   33.98
Thu Feb  5 16:00:15 CST 2009: 591 secs for 3 iters, 197 secs per iter
build1  135.34  90.71   32.95
build2  220.43  91.55   33.99
Thu Feb  5 16:04:00 CST 2009: 816 secs for 4 iters, 204 secs per iter
build1  121.68  91.09   33.26
build2  202.45  91.01   34.37
Thu Feb  5 16:07:30 CST 2009: 1026 secs for 5 iters, 205 secs per iter
build1  120.51  90.19   33.42
build2  217.56  90.38   34.18
Thu Feb  5 16:11:13 CST 2009: 1249 secs for 6 iters, 208 secs per iter
build1  137.14  90.33   34.54
build2  243.14  90.93   34.33
Thu Feb  5 16:15:22 CST 2009: 1498 secs for 7 iters, 214 secs per iter
build1  141.47  91.14   33.42
build2  249.78  91.57   34.10
Thu Feb  5 16:19:37 CST 2009: 1753 secs for 8 iters, 219 secs per iter
build1  147.72  90.42   34.04
build2  252.57  90.91   33.73
Thu Feb  5 16:23:58 CST 2009: 2014 secs for 9 iters, 223 secs per iter
build1  137.40  89.80   33.99
build2  248.67  91.18   34.03
Thu Feb  5 16:28:13 CST 2009: 2269 secs for 10 iters, 226 secs per iter


With kernel 2.6.29-rc2-slubstat (default slub_max_slub):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:21:37 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   155.54  91.90   33.56
build2  163.86  91.69   34.52
Thu Feb  5 13:24:30 CST 2009: 173 secs for 1 iters, 173 secs per iter
build1  135.63  90.42   33.88
build2  308.88  91.63   34.71
Thu Feb  5 13:29:57 CST 2009: 500 secs for 2 iters, 250 secs per iter
build1  127.49  90.79   33.24
ymzhang  28382  4079  0 13:29 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  414.77  91.58   34.01
Thu Feb  5 13:37:05 CST 2009: 928 secs for 3 iters, 309 secs per iter
build1  146.99  91.07   33.59
ymzhang  24569  4079  0 13:37 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  505.73  93.01   34.12
Thu Feb  5 13:45:46 CST 2009: 1449 secs for 4 iters, 362 secs per iter
build1  163.20  91.35   34.39
ymzhang  20830  4079  0 13:45 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2

The 'tar xfj' line is a sign that if build2's untar finishs when build1 finishs compiling.
Above result shows since iters 3, build2's untar isn't finished although build1
finishs compiling already. So build1 result seems quite stable while build2 result is growing.

Comparing with slqb, the result is bad.


With kernel 2.6.29-rc2-slubstat (slub_max_slub=1, so names_cache's order is 1):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 14:42:35 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   161.61  92.09   34.14
build2  167.92  91.78   34.38
Thu Feb  5 14:45:30 CST 2009: 175 secs for 1 iters, 174 secs per iter
build1  128.22  91.02   33.39
build2  236.95  90.59   34.45
Thu Feb  5 14:49:37 CST 2009: 422 secs for 2 iters, 211 secs per iter
build1  134.34  90.56   33.94
ymzhang  28297  4069  0 14:49 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  338.49  91.10   34.33
Thu Feb  5 14:55:27 CST 2009: 772 secs for 3 iters, 257 secs per iter
build1  144.50  90.63   34.00
ymzhang  24398  4069  0 14:55 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  415.44  91.32   34.29
Thu Feb  5 15:02:33 CST 2009: 1198 secs for 4 iters, 299 secs per iter
build1  137.31  91.03   33.80
ymzhang  20580  4069  0 15:02 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  399.31  91.88   34.31
Thu Feb  5 15:09:24 CST 2009: 1609 secs for 5 iters, 321 secs per iter
build1  147.69  91.39   33.98
ymzhang  16743  4069  0 15:09 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  397.33  91.72   34.52
Thu Feb  5 15:16:12 CST 2009: 2017 secs for 6 iters, 336 secs per iter
build1  149.65  91.28   33.65
ymzhang  12864  4069  0 15:16 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  469.35  91.78   34.15
Thu Feb  5 15:24:12 CST 2009: 2497 secs for 7 iters, 356 secs per iter
build1  138.36  90.66   34.03
ymzhang   9077  4069  0 15:24 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  498.02  91.39   34.60
Thu Feb  5 15:32:38 CST 2009: 3003 secs for 8 iters, 375 secs per iter

We see some improvement, but the improvement isn't big. The result is still worse than slqb's.


With kernel 2.6.29-rc2-slubstat (slub_max_slub=0, names_cache order is 0):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:59:02 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   170.00  92.26   33.63
build2  176.22  91.18   35.16
Thu Feb  5 14:02:04 CST 2009: 182 secs for 1 iters, 182 secs per iter
build1  136.31  90.58   33.98
build2  201.79  91.32   34.92
Thu Feb  5 14:05:31 CST 2009: 389 secs for 2 iters, 194 secs per iter
build1  114.12  91.03   33.86
build2  205.86  90.70   34.27
Thu Feb  5 14:09:02 CST 2009: 600 secs for 3 iters, 200 secs per iter
build1  131.26  90.63   35.46
build2  227.58  91.36   34.97
Thu Feb  5 14:12:56 CST 2009: 834 secs for 4 iters, 208 secs per iter
build1  151.93  90.47   35.87
build2  259.79  91.01   35.35
Thu Feb  5 14:17:21 CST 2009: 1099 secs for 5 iters, 219 secs per iter
build1  106.57  92.21   35.75
ymzhang  16139  4052  0 14:17 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  233.17  90.77   35.05
Thu Feb  5 14:21:19 CST 2009: 1337 secs for 6 iters, 222 secs per iter
build1  139.56  90.82   33.61
build2  214.44  91.87   34.43
Thu Feb  5 14:25:02 CST 2009: 1560 secs for 7 iters, 222 secs per iter
build1  124.91  90.98   34.30
build2  214.43  91.79   34.35
Thu Feb  5 14:28:44 CST 2009: 1782 secs for 8 iters, 222 secs per iter
build1  134.76  90.80   33.59
build2  239.88  91.81   34.45
Thu Feb  5 14:32:48 CST 2009: 2026 secs for 9 iters, 225 secs per iter
build1  141.23  90.98   33.74
build2  250.96  91.72   34.20
Thu Feb  5 14:37:06 CST 2009: 2284 secs for 10 iters, 228 secs per iter


I repeat the testing and the results have fluctuation. I would like to
consider the result of slub (slub_max_order=0) is equal to slqb's.

Another testing is to start 2 parallel build1 testing. slub (default order) seems
having 17% regression against slqb. With slub_max_order=1, slub is ok.




^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-06  0:47                       ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-06  0:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> On Wed, 4 Feb 2009, Zhang, Yanmin wrote:
> > On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> > > On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > > > 
> > > > Would you like to test it on your machines?
> > > 
> > > Indeed I shall, starting in a few hours when I've finished with trying
> > > the script I promised yesterday to send you.  And I won't be at all
> > > surprised if your patch eliminates my worst cases, because I don't
> > > expect to have any significant amount of free memory during my testing,
> > > and my swap testing suffers from slub's thirst for higher orders.
> > > 
> > > But I don't believe the kind of check you're making is appropriate,
> > > and I do believe that when you try more extensive testing, you'll find
> > > regressions in other tests which were relying on the higher orders.
> > 
> > Yes, I agree. And we need find such tests which causes both memory used up
> > and lots of higher-order allocations.
> 
> Sceptical though I am about your free_pages test in slub's allocate_slab(),
> I can confirm that your patch does well on my swapping loads, performing
> slightly (not necessarily significantly) better than slab on those loads
As matter of fact, the patch has the same effect like slub_max_order=0 on
your workload, except the additional cost to check free pages.

> (though not quite as well on the "immune" machine where slub was already
> keeping up with slab; and I haven't even bothered to try it on the machine
> which behaves so very badly that no conclusions can yet be drawn).
> 
> I then tried a patch I thought obviously better than yours: just mask
> off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> alloc_slab_page(): so we're not trying to infer anything about high-
> order availability from the number of free order-0 pages, but actually
> going to look for it and taking it if it's free, forgetting it if not.
> 
> That didn't work well at all: almost as bad as the unmodified slub.c.
> I decided that was due to __alloc_pages_internal()'s
> wakeup_kswapd(zone, order): just expressing an interest in a high-
> order page was enough to send it off trying to reclaim them, though
> not directly.  Hacked in a condition to suppress that in this case:
> worked a lot better, but not nearly as well as yours.  I supposed
> that was somehow(?) due to the subsequent get_page_from_freelist()
> calls with different watermarking: hacked in another __GFP flag to
> break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> Much better, getting close, but still not as good as yours.  
> 
> I think I'd better turn back to things I understand better!
Your investigation is really detail-focused. I also did some testing.
i>>?
I changed the script a little. i>>?As no the laptop devices which
create the worst result difference, I tested on my stoakley which has
2 qual-core processors, 8GB memory i>>?(started kernel with mem=1GB),
i>>?a scsi disk as swap partition (35GB). 

The testing runs in a loop. It starts 2 tasks to run kbuild of 2.6.28,
build1 and build2 separately. build1 runs on tmpfs directly. build2 runs
on a ext2 loop fs on tmpfs. Both build untar the source tarball firstly, then
use the defconfig to compile kernel. The script does a sync between build1 and
build2, so they could start at the same time every iteration.
i>>?
[root@lkp-st02-x8664 ~]# slabinfo -AD|head -n 15
Name                   Objects    Alloc     Free   %Fast
names_cache                 64 11734829 11734830  99  99 
filp                      1195  8484074  8482982  90   3 
vm_area_struct            3830  7688583  7684900  92  54 
buffer_head              33970  3832771  3798977  94   0 
bio-0                     5906  2383929  2378119  91  13 
journal_handle            1360  2182580  2182580  99  99 

As a matter of fact, I got similiar cache statstics with kbuild on different machines.
i>>?names_cache's object size is 4096. filp and i>>?vm_area_struct's are 192/168.
i>>?i>>?names_cache's default order is 3. Other active kmem_cache's order is 0.
names_cache is used by getname=>__getname from sys_open/execve/faccessstat, etc.
Although kernel allocates a page every time for i>>?i>>?names_cache object, mostly, kernel
only uses a dozen of bytes per i>>?names_cache object.

With kernel 2.6.29-rc2-slqb0121 (get slqb patch from Pekka's git tree):
Thu Feb  5 15:50:24 CST 2009 2.6.29-rc2slqb0121stat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   144.15  91.70   32.99
build2  159.81  91.83   34.27
Thu Feb  5 15:53:09 CST 2009: 165 secs for 1 iters, 165 secs per iter
build1  123.02  90.29   33.08
build2  204.52  90.28   34.17
Thu Feb  5 15:56:39 CST 2009: 375 secs for 2 iters, 187 secs per iter
build1  132.74  90.60   33.45
build2  210.11  90.80   33.98
Thu Feb  5 16:00:15 CST 2009: 591 secs for 3 iters, 197 secs per iter
build1  135.34  90.71   32.95
build2  220.43  91.55   33.99
Thu Feb  5 16:04:00 CST 2009: 816 secs for 4 iters, 204 secs per iter
build1  121.68  91.09   33.26
build2  202.45  91.01   34.37
Thu Feb  5 16:07:30 CST 2009: 1026 secs for 5 iters, 205 secs per iter
build1  120.51  90.19   33.42
build2  217.56  90.38   34.18
Thu Feb  5 16:11:13 CST 2009: 1249 secs for 6 iters, 208 secs per iter
build1  137.14  90.33   34.54
build2  243.14  90.93   34.33
Thu Feb  5 16:15:22 CST 2009: 1498 secs for 7 iters, 214 secs per iter
build1  141.47  91.14   33.42
build2  249.78  91.57   34.10
Thu Feb  5 16:19:37 CST 2009: 1753 secs for 8 iters, 219 secs per iter
build1  147.72  90.42   34.04
build2  252.57  90.91   33.73
Thu Feb  5 16:23:58 CST 2009: 2014 secs for 9 iters, 223 secs per iter
build1  137.40  89.80   33.99
build2  248.67  91.18   34.03
Thu Feb  5 16:28:13 CST 2009: 2269 secs for 10 iters, 226 secs per iter


With kernel 2.6.29-rc2-slubstat (default slub_max_slub):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:21:37 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   155.54  91.90   33.56
build2  163.86  91.69   34.52
Thu Feb  5 13:24:30 CST 2009: 173 secs for 1 iters, 173 secs per iter
build1  135.63  90.42   33.88
build2  308.88  91.63   34.71
Thu Feb  5 13:29:57 CST 2009: 500 secs for 2 iters, 250 secs per iter
build1  127.49  90.79   33.24
ymzhang  28382  4079  0 13:29 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  414.77  91.58   34.01
Thu Feb  5 13:37:05 CST 2009: 928 secs for 3 iters, 309 secs per iter
build1  146.99  91.07   33.59
ymzhang  24569  4079  0 13:37 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  505.73  93.01   34.12
Thu Feb  5 13:45:46 CST 2009: 1449 secs for 4 iters, 362 secs per iter
build1  163.20  91.35   34.39
ymzhang  20830  4079  0 13:45 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2

The 'tar xfj' line is a sign that if build2's untar finishs when build1 finishs compiling.
Above result shows since iters 3, build2's untar isn't finished although build1
finishs compiling already. So build1 result seems quite stable while build2 result is growing.

Comparing with slqb, the result is bad.

i>>?
With kernel 2.6.29-rc2-slubstat (slub_max_slub=1, so i>>?names_cache's order is 1):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 14:42:35 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   161.61  92.09   34.14
build2  167.92  91.78   34.38
Thu Feb  5 14:45:30 CST 2009: 175 secs for 1 iters, 174 secs per iter
build1  128.22  91.02   33.39
build2  236.95  90.59   34.45
Thu Feb  5 14:49:37 CST 2009: 422 secs for 2 iters, 211 secs per iter
build1  134.34  90.56   33.94
ymzhang  28297  4069  0 14:49 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  338.49  91.10   34.33
Thu Feb  5 14:55:27 CST 2009: 772 secs for 3 iters, 257 secs per iter
build1  144.50  90.63   34.00
ymzhang  24398  4069  0 14:55 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  415.44  91.32   34.29
Thu Feb  5 15:02:33 CST 2009: 1198 secs for 4 iters, 299 secs per iter
build1  137.31  91.03   33.80
ymzhang  20580  4069  0 15:02 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  399.31  91.88   34.31
Thu Feb  5 15:09:24 CST 2009: 1609 secs for 5 iters, 321 secs per iter
build1  147.69  91.39   33.98
ymzhang  16743  4069  0 15:09 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  397.33  91.72   34.52
Thu Feb  5 15:16:12 CST 2009: 2017 secs for 6 iters, 336 secs per iter
build1  149.65  91.28   33.65
ymzhang  12864  4069  0 15:16 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  469.35  91.78   34.15
Thu Feb  5 15:24:12 CST 2009: 2497 secs for 7 iters, 356 secs per iter
build1  138.36  90.66   34.03
ymzhang   9077  4069  0 15:24 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  498.02  91.39   34.60
Thu Feb  5 15:32:38 CST 2009: 3003 secs for 8 iters, 375 secs per iter

We see some improvement, but the improvement isn't big. The result is still worse than slqb's.

i>>?
With kernel 2.6.29-rc2-slubstat (slub_max_slub=0, i>>?names_cache order is 0):
[ymzhang@lkp-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:59:02 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@lkp-st02-x8664 Hugh]$ build1   170.00  92.26   33.63
build2  176.22  91.18   35.16
Thu Feb  5 14:02:04 CST 2009: 182 secs for 1 iters, 182 secs per iter
build1  136.31  90.58   33.98
build2  201.79  91.32   34.92
Thu Feb  5 14:05:31 CST 2009: 389 secs for 2 iters, 194 secs per iter
build1  114.12  91.03   33.86
build2  205.86  90.70   34.27
Thu Feb  5 14:09:02 CST 2009: 600 secs for 3 iters, 200 secs per iter
build1  131.26  90.63   35.46
build2  227.58  91.36   34.97
Thu Feb  5 14:12:56 CST 2009: 834 secs for 4 iters, 208 secs per iter
build1  151.93  90.47   35.87
build2  259.79  91.01   35.35
Thu Feb  5 14:17:21 CST 2009: 1099 secs for 5 iters, 219 secs per iter
build1  106.57  92.21   35.75
ymzhang  16139  4052  0 14:17 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  233.17  90.77   35.05
Thu Feb  5 14:21:19 CST 2009: 1337 secs for 6 iters, 222 secs per iter
build1  139.56  90.82   33.61
build2  214.44  91.87   34.43
Thu Feb  5 14:25:02 CST 2009: 1560 secs for 7 iters, 222 secs per iter
build1  124.91  90.98   34.30
build2  214.43  91.79   34.35
Thu Feb  5 14:28:44 CST 2009: 1782 secs for 8 iters, 222 secs per iter
build1  134.76  90.80   33.59
build2  239.88  91.81   34.45
Thu Feb  5 14:32:48 CST 2009: 2026 secs for 9 iters, 225 secs per iter
build1  141.23  90.98   33.74
build2  250.96  91.72   34.20
Thu Feb  5 14:37:06 CST 2009: 2284 secs for 10 iters, 228 secs per iter


I repeat the testing and the results have fluctuation. I would like to
consider the result of slub (slub_max_order=0) is equal to slqb's.

Another testing is to start 2 parallel build1 testing. slub (default order) seems
having 17% regression against slqb. With slub_max_order=1, slub is ok.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-05 19:04                     ` Hugh Dickins
@ 2009-02-06  8:57                       ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-06  8:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zhang, Yanmin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

Hi Hugh,

On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> I then tried a patch I thought obviously better than yours: just mask
> off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> alloc_slab_page(): so we're not trying to infer anything about high-
> order availability from the number of free order-0 pages, but actually
> going to look for it and taking it if it's free, forgetting it if not.
> 
> That didn't work well at all: almost as bad as the unmodified slub.c.
> I decided that was due to __alloc_pages_internal()'s
> wakeup_kswapd(zone, order): just expressing an interest in a high-
> order page was enough to send it off trying to reclaim them, though
> not directly.  Hacked in a condition to suppress that in this case:
> worked a lot better, but not nearly as well as yours.  I supposed
> that was somehow(?) due to the subsequent get_page_from_freelist()
> calls with different watermarking: hacked in another __GFP flag to
> break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> Much better, getting close, but still not as good as yours.  

Did you look at it with oprofile? One thing to keep in mind is that if
there are 4K allocations going on, your approach will get double the
overhead of page allocations (which can be substantial performance hit
for slab).

			Pekka


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-06  8:57                       ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-06  8:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zhang, Yanmin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

Hi Hugh,

On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> I then tried a patch I thought obviously better than yours: just mask
> off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> alloc_slab_page(): so we're not trying to infer anything about high-
> order availability from the number of free order-0 pages, but actually
> going to look for it and taking it if it's free, forgetting it if not.
> 
> That didn't work well at all: almost as bad as the unmodified slub.c.
> I decided that was due to __alloc_pages_internal()'s
> wakeup_kswapd(zone, order): just expressing an interest in a high-
> order page was enough to send it off trying to reclaim them, though
> not directly.  Hacked in a condition to suppress that in this case:
> worked a lot better, but not nearly as well as yours.  I supposed
> that was somehow(?) due to the subsequent get_page_from_freelist()
> calls with different watermarking: hacked in another __GFP flag to
> break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> Much better, getting close, but still not as good as yours.  

Did you look at it with oprofile? One thing to keep in mind is that if
there are 4K allocations going on, your approach will get double the
overhead of page allocations (which can be substantial performance hit
for slab).

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-06  8:57                       ` Pekka Enberg
@ 2009-02-06 12:33                         ` Hugh Dickins
  -1 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-06 12:33 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 6 Feb 2009, Pekka Enberg wrote:
> On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> > I then tried a patch I thought obviously better than yours: just mask
> > off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> > alloc_slab_page(): so we're not trying to infer anything about high-
> > order availability from the number of free order-0 pages, but actually
> > going to look for it and taking it if it's free, forgetting it if not.
> > 
> > That didn't work well at all: almost as bad as the unmodified slub.c.
> > I decided that was due to __alloc_pages_internal()'s
> > wakeup_kswapd(zone, order): just expressing an interest in a high-
> > order page was enough to send it off trying to reclaim them, though
> > not directly.  Hacked in a condition to suppress that in this case:
> > worked a lot better, but not nearly as well as yours.  I supposed
> > that was somehow(?) due to the subsequent get_page_from_freelist()
> > calls with different watermarking: hacked in another __GFP flag to
> > break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> > Much better, getting close, but still not as good as yours.  
> 
> Did you look at it with oprofile?

No, I didn't.  I didn't say so, but again it was elapsed time that
I was focussing on, so I don't think oprofile would be relevant.
There are some differences in system time, of course, consistent
with your point; but they're generally an order of magnitude less,
so didn't excite my interest.

> One thing to keep in mind is that if
> there are 4K allocations going on, your approach will get double the
> overhead of page allocations (which can be substantial performance hit
> for slab).

Sure, and even the current allocate_slab() is inefficient in that
respect: I've followed it because I do for now have an interest in
the stats, but if stats are configured off then there's no point in
dividing it into two stages; and if they are really intended to be
ORDER_FALLBACK stats, then it shouldn't divide into two stages when
oo_order(s->oo) == oo_order(s->min).  On the other hand, I find it
interesting to see how often the __GFP_NORETRY fails, even when
the order is the same each time (and usually 0).

Hugh

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-06 12:33                         ` Hugh Dickins
  0 siblings, 0 replies; 197+ messages in thread
From: Hugh Dickins @ 2009-02-06 12:33 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 6 Feb 2009, Pekka Enberg wrote:
> On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> > I then tried a patch I thought obviously better than yours: just mask
> > off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> > alloc_slab_page(): so we're not trying to infer anything about high-
> > order availability from the number of free order-0 pages, but actually
> > going to look for it and taking it if it's free, forgetting it if not.
> > 
> > That didn't work well at all: almost as bad as the unmodified slub.c.
> > I decided that was due to __alloc_pages_internal()'s
> > wakeup_kswapd(zone, order): just expressing an interest in a high-
> > order page was enough to send it off trying to reclaim them, though
> > not directly.  Hacked in a condition to suppress that in this case:
> > worked a lot better, but not nearly as well as yours.  I supposed
> > that was somehow(?) due to the subsequent get_page_from_freelist()
> > calls with different watermarking: hacked in another __GFP flag to
> > break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> > Much better, getting close, but still not as good as yours.  
> 
> Did you look at it with oprofile?

No, I didn't.  I didn't say so, but again it was elapsed time that
I was focussing on, so I don't think oprofile would be relevant.
There are some differences in system time, of course, consistent
with your point; but they're generally an order of magnitude less,
so didn't excite my interest.

> One thing to keep in mind is that if
> there are 4K allocations going on, your approach will get double the
> overhead of page allocations (which can be substantial performance hit
> for slab).

Sure, and even the current allocate_slab() is inefficient in that
respect: I've followed it because I do for now have an interest in
the stats, but if stats are configured off then there's no point in
dividing it into two stages; and if they are really intended to be
ORDER_FALLBACK stats, then it shouldn't divide into two stages when
oo_order(s->oo) == oo_order(s->min).  On the other hand, I find it
interesting to see how often the __GFP_NORETRY fails, even when
the order is the same each time (and usually 0).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-06 12:33                         ` Hugh Dickins
@ 2009-02-10  8:56                           ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-10  8:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 2009-02-06 at 12:33 +0000, Hugh Dickins wrote:
> On Fri, 6 Feb 2009, Pekka Enberg wrote:
> > On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> > > I then tried a patch I thought obviously better than yours: just mask
> > > off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> > > alloc_slab_page(): so we're not trying to infer anything about high-
> > > order availability from the number of free order-0 pages, but actually
> > > going to look for it and taking it if it's free, forgetting it if not.
> > > 
> > > That didn't work well at all: almost as bad as the unmodified slub.c.
> > > I decided that was due to __alloc_pages_internal()'s
> > > wakeup_kswapd(zone, order): just expressing an interest in a high-
> > > order page was enough to send it off trying to reclaim them, though
> > > not directly.  Hacked in a condition to suppress that in this case:
> > > worked a lot better, but not nearly as well as yours.  I supposed
> > > that was somehow(?) due to the subsequent get_page_from_freelist()
> > > calls with different watermarking: hacked in another __GFP flag to
> > > break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> > > Much better, getting close, but still not as good as yours.  
I did the similiar hack. get_page_from_freelist, wakeup_kswapd, try_to_free_pages,
and drain_all_pages consume time. If I disable them one by one, I see the result
is improved gradually.

> > 
> > Did you look at it with oprofile?
> 
> No, I didn't.  I didn't say so, but again it was elapsed time that
> I was focussing on, so I don't think oprofile would be relevant.
The vmstat data varies very much when testing runs. The original test case
consists of 2 kbuild tasks and sometimes the 2 tasks almost run serially
because it takes a long time to untie kernel source tarball on the loop ext2
fs. So it's not appropriate to collect oprofile data.

I changed the script to run 2 tasks on tmpfs without loop ext2 device.
The result difference between slub_max_order=0 and default order is about 25%.
When kernel building is started, vmstat sys time is about 4%~10% on my
2 qual-core processor stoakley. io-wait is mostly 40%~80%. I collected the
oprofile data. Mostly, only free_pages_bulk seems a little abnormal. With
default order, free_pages_bulk is more than 1% while it's 0.23%. By changing
total memory quantity, free_pages_bulk difference between slub_max_order=0 and
default order is about 1%.


> There are some differences in system time, of course, consistent
> with your point; but they're generally an order of magnitude less,
> so didn't excite my interest.
> 
> > One thing to keep in mind is that if
> > there are 4K allocations going on, your approach will get double the
> > overhead of page allocations (which can be substantial performance hit
> > for slab).
> 
> Sure, and even the current allocate_slab() is inefficient in that
> respect: I've followed it because I do for now have an interest in
> the stats, but if stats are configured off then there's no point in
> dividing it into two stages; and if they are really intended to be
> ORDER_FALLBACK stats, then it shouldn't divide into two stages when
> oo_order(s->oo) == oo_order(s->min).
You are right theoretically. Under the real environment, the order mostly is 0
when oo_order(s->oo) == oo_order(s->min), and order 0 page allocation almost
doesn't fail even with flag __GFP_NORETRY. When default order isn't 0, mostly,
oo_order(s->oo) isn't equal to oo_order(s->min).

>   On the other hand, I find it
> interesting to see how often the __GFP_NORETRY fails, even when
> the order is the same each time (and usually 0).



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-10  8:56                           ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-02-10  8:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 2009-02-06 at 12:33 +0000, Hugh Dickins wrote:
> On Fri, 6 Feb 2009, Pekka Enberg wrote:
> > On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> > > I then tried a patch I thought obviously better than yours: just mask
> > > off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> > > alloc_slab_page(): so we're not trying to infer anything about high-
> > > order availability from the number of free order-0 pages, but actually
> > > going to look for it and taking it if it's free, forgetting it if not.
> > > 
> > > That didn't work well at all: almost as bad as the unmodified slub.c.
> > > I decided that was due to __alloc_pages_internal()'s
> > > wakeup_kswapd(zone, order): just expressing an interest in a high-
> > > order page was enough to send it off trying to reclaim them, though
> > > not directly.  Hacked in a condition to suppress that in this case:
> > > worked a lot better, but not nearly as well as yours.  I supposed
> > > that was somehow(?) due to the subsequent get_page_from_freelist()
> > > calls with different watermarking: hacked in another __GFP flag to
> > > break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> > > Much better, getting close, but still not as good as yours.  
I did the similiar hack. i>>?get_page_from_freelist, wakeup_kswapd, try_to_free_pages,
and drain_all_pages consume time. If I disable them one by one, I see the result
is improved gradually.

> > 
> > Did you look at it with oprofile?
> 
> No, I didn't.  I didn't say so, but again it was elapsed time that
> I was focussing on, so I don't think oprofile would be relevant.
The vmstat data varies very much when testing runs. The original test case
consists of 2 kbuild tasks and sometimes the 2 tasks almost run serially
because it takes a long time to untie kernel source tarball on the loop ext2
fs. So it's not appropriate to collect oprofile data.

I changed the script to run 2 tasks on tmpfs without loop ext2 device.
The result difference between slub_max_order=0 and default order is about 25%.
When kernel building is started, vmstat sys time is about 4%~10% on my
2 qual-core processor stoakley. io-wait is mostly 40%~80%. I collected the
oprofile data. Mostly, only free_pages_bulk seems a little abnormal. With
default order, i>>?free_pages_bulk is more than 1% while it's 0.23%. By changing
total memory quantity, i>>?free_pages_bulk difference between i>>?slub_max_order=0 and
default order is about 1%.


> There are some differences in system time, of course, consistent
> with your point; but they're generally an order of magnitude less,
> so didn't excite my interest.
> 
> > One thing to keep in mind is that if
> > there are 4K allocations going on, your approach will get double the
> > overhead of page allocations (which can be substantial performance hit
> > for slab).
> 
> Sure, and even the current allocate_slab() is inefficient in that
> respect: I've followed it because I do for now have an interest in
> the stats, but if stats are configured off then there's no point in
> dividing it into two stages; and if they are really intended to be
> ORDER_FALLBACK stats, then it shouldn't divide into two stages when
> oo_order(s->oo) == oo_order(s->min).
You are right theoretically. Under the real environment, the order mostly is 0
when i>>?oo_order(s->oo) == oo_order(s->min), and order 0 page allocation almost
doesn't fail even with flag i>>?__GFP_NORETRY. When default order isn't 0, mostly,
i>>?oo_order(s->oo) isn't equal to i>>?oo_order(s->min).

>   On the other hand, I find it
> interesting to see how often the __GFP_NORETRY fails, even when
> the order is the same each time (and usually 0).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-04 20:09                                                     ` Christoph Lameter
@ 2009-02-05  3:18                                                       ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-05  3:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Thursday 05 February 2009 07:09:15 Christoph Lameter wrote:
> On Wed, 4 Feb 2009, Nick Piggin wrote:
> > That's very true, and we touched on this earlier. It is I guess
> > you can say a downside of queueing. But an analogous situation
> > in SLUB would be that lots of pages on the partial list with
> > very few free objects, or freeing objects to pages with few
> > objects in them. Basically SLUB will have to do the extra work
> > in the fastpath.
>
> But these are pages with mostly allocated objects and just a few objects
> free. The SLAB case is far worse: You have N objects on a queue and they
> are keeping possibly N pages away from the page allocator and in those
> pages *nothing* is used.

Periodic queue trimming should prevent this from becoming a big problem.
It will trim away those objects, and so subsequent allocations will come
from new pages and be densely packed. I don't think I've seen a problem
in SLAB reported from this phenomenon, so I'm not too concerned about it
at the moment.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-05  3:18                                                       ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-05  3:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Thursday 05 February 2009 07:09:15 Christoph Lameter wrote:
> On Wed, 4 Feb 2009, Nick Piggin wrote:
> > That's very true, and we touched on this earlier. It is I guess
> > you can say a downside of queueing. But an analogous situation
> > in SLUB would be that lots of pages on the partial list with
> > very few free objects, or freeing objects to pages with few
> > objects in them. Basically SLUB will have to do the extra work
> > in the fastpath.
>
> But these are pages with mostly allocated objects and just a few objects
> free. The SLAB case is far worse: You have N objects on a queue and they
> are keeping possibly N pages away from the page allocator and in those
> pages *nothing* is used.

Periodic queue trimming should prevent this from becoming a big problem.
It will trim away those objects, and so subsequent allocations will come
from new pages and be densely packed. I don't think I've seen a problem
in SLAB reported from this phenomenon, so I'm not too concerned about it
at the moment.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-04 20:10                                                 ` Christoph Lameter
@ 2009-02-05  3:14                                                   ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-05  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Thursday 05 February 2009 07:10:31 Christoph Lameter wrote:
> On Tue, 3 Feb 2009, Pekka Enberg wrote:
> > Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> > the size of the queue. But we do the same thing in SLAB with
> > alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> > I'm not sure what you mean when you say that the queues will "grow
> > unconstrained" (in either of the allocators). Hmm?
>
> Nick said he wanted to defer queue processing. If the water marks are
> checked and queue processing run then of course queue processing is not
> deferred and the queue does not build up further.

I don't think I ever said anything as ambiguous as "queue processing".
This subthread was started by your concern of periodic queue trimming,
and I was definitely talking about the possibility to defer *that*.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-05  3:14                                                   ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-05  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Thursday 05 February 2009 07:10:31 Christoph Lameter wrote:
> On Tue, 3 Feb 2009, Pekka Enberg wrote:
> > Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> > the size of the queue. But we do the same thing in SLAB with
> > alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> > I'm not sure what you mean when you say that the queues will "grow
> > unconstrained" (in either of the allocators). Hmm?
>
> Nick said he wanted to defer queue processing. If the water marks are
> checked and queue processing run then of course queue processing is not
> deferred and the queue does not build up further.

I don't think I ever said anything as ambiguous as "queue processing".
This subthread was started by your concern of periodic queue trimming,
and I was definitely talking about the possibility to defer *that*.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 18:42                                               ` Pekka Enberg
@ 2009-02-04 20:10                                                 ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-04 20:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 3 Feb 2009, Pekka Enberg wrote:

> Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> the size of the queue. But we do the same thing in SLAB with
> alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> I'm not sure what you mean when you say that the queues will "grow
> unconstrained" (in either of the allocators). Hmm?

Nick said he wanted to defer queue processing. If the water marks are
checked and queue processing run then of course queue processing is not
deferred and the queue does not build up further.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-04 20:10                                                 ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-04 20:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 3 Feb 2009, Pekka Enberg wrote:

> Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> the size of the queue. But we do the same thing in SLAB with
> alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> I'm not sure what you mean when you say that the queues will "grow
> unconstrained" (in either of the allocators). Hmm?

Nick said he wanted to defer queue processing. If the water marks are
checked and queue processing run then of course queue processing is not
deferred and the queue does not build up further.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-04  4:22                                                   ` Nick Piggin
@ 2009-02-04 20:09                                                     ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-04 20:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, 4 Feb 2009, Nick Piggin wrote:

> That's very true, and we touched on this earlier. It is I guess
> you can say a downside of queueing. But an analogous situation
> in SLUB would be that lots of pages on the partial list with
> very few free objects, or freeing objects to pages with few
> objects in them. Basically SLUB will have to do the extra work
> in the fastpath.

But these are pages with mostly allocated objects and just a few objects
free. The SLAB case is far worse: You have N objects on a queue and they
are keeping possibly N pages away from the page allocator and in those
pages *nothing* is used.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-04 20:09                                                     ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-04 20:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, 4 Feb 2009, Nick Piggin wrote:

> That's very true, and we touched on this earlier. It is I guess
> you can say a downside of queueing. But an analogous situation
> in SLUB would be that lots of pages on the partial list with
> very few free objects, or freeing objects to pages with few
> objects in them. Basically SLUB will have to do the extra work
> in the fastpath.

But these are pages with mostly allocated objects and just a few objects
free. The SLAB case is far worse: You have N objects on a queue and they
are keeping possibly N pages away from the page allocator and in those
pages *nothing* is used.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 18:47                                                 ` Pekka Enberg
@ 2009-02-04  4:22                                                   ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-04  4:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, Zhang, Yanmin, Lin Ming,
	linux-mm, linux-kernel, Andrew Morton, Linus Torvalds

On Wednesday 04 February 2009 05:47:48 Pekka Enberg wrote:
> On Tue, Feb 3, 2009 at 8:42 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >> It will grow unconstrained if you elect to defer queue processing. That
> >> was what we discussed.
> >
> > Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> > the size of the queue. But we do the same thing in SLAB with
> > alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> > I'm not sure what you mean when you say that the queues will "grow
> > unconstrained" (in either of the allocators). Hmm?
>
> That said, I can imagine a worst-case scenario where a queue with N
> objects is pinning N mostly empty slabs. As soon as we hit the
> periodical flush, we might need to do tons of work. That's pretty hard
> to control with watermarks as well as the scenario is solely dependent
> on allocation/free patterns.

That's very true, and we touched on this earlier. It is I guess
you can say a downside of queueing. But an analogous situation
in SLUB would be that lots of pages on the partial list with
very few free objects, or freeing objects to pages with few
objects in them. Basically SLUB will have to do the extra work
in the fastpath.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-04  4:22                                                   ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-04  4:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, Zhang, Yanmin, Lin Ming,
	linux-mm, linux-kernel, Andrew Morton, Linus Torvalds

On Wednesday 04 February 2009 05:47:48 Pekka Enberg wrote:
> On Tue, Feb 3, 2009 at 8:42 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> >> It will grow unconstrained if you elect to defer queue processing. That
> >> was what we discussed.
> >
> > Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> > the size of the queue. But we do the same thing in SLAB with
> > alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> > I'm not sure what you mean when you say that the queues will "grow
> > unconstrained" (in either of the allocators). Hmm?
>
> That said, I can imagine a worst-case scenario where a queue with N
> objects is pinning N mostly empty slabs. As soon as we hit the
> periodical flush, we might need to do tons of work. That's pretty hard
> to control with watermarks as well as the scenario is solely dependent
> on allocation/free patterns.

That's very true, and we touched on this earlier. It is I guess
you can say a downside of queueing. But an analogous situation
in SLUB would be that lots of pages on the partial list with
very few free objects, or freeing objects to pages with few
objects in them. Basically SLUB will have to do the extra work
in the fastpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 17:33                                             ` Christoph Lameter
@ 2009-02-04  4:07                                               ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-04  4:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wednesday 04 February 2009 04:33:14 Christoph Lameter wrote:
> On Tue, 3 Feb 2009, Nick Piggin wrote:
> > Quite obviously it should. Behaviour of a slab allocation on behalf of
> > some task constrained within a given node should not depend on the task
> > which has previously run on this CPU and made some allocations. Surely
> > you can see this behaviour is not nice.
>
> If you want cache hot objects then its better to use what a prior task
> has used. This opportunistic use is only done if the task is not asking
> for memory from a specifc node. There is another tradeoff here.
>
> SLABs method there is to ignore all caching advantages even if the task
> did not ask for memory from a specific node. So it gets cache cold objects
> and if the node to allow from is remote then it always must use the slow
> path.

Yeah, but I don't think you actually demonstrated any real advantages
to it, and there are obvious failure modes where constraints aren't
obeyed, so I'm going to leave it as-is in SLQB.

Objects where cache hotness tends to be most important are the shorter
lived ones, and objects where constraints matter are longer lived ones,
so I think this is pretty reasonable.

Also, you've just been spending lots of time arguing that cache hotness
is not so important (because SLUB doesn't do LIFO like SLAB and SLQB).


> > > Which have similar issues since memory policy application is depending
> > > on a task policy and on memory migration that has been applied to an
> > > address range.
> >
> > What similar issues? If a task ask to have slab allocations constrained
> > to node 0, then SLUB hands out objects from other nodes, then that's bad.
>
> Of course. A task can ask to have allocations from node 0 and it will get
> the object from node 0. But if the task does not care to ask for data
> from a specific node then it can be satisfied from the cpu slab which
> contains cache hot objects.

But if it is using constrained allocations, then it is also asking for
allocations from node 0.


> > > > But that is wrong. The lists obviously have high water marks that
> > > > get trimmed down. Periodic trimming as I keep saying basically is
> > > > alrady so infrequent that it is irrelevant (millions of objects
> > > > per cpu can be allocated anyway between existing trimming interval)
> > >
> > > Trimming through water marks and allocating memory from the page
> > > allocator is going to be very frequent if you continually allocate on
> > > one processor and free on another.
> >
> > Um yes, that's the point. But you previously claimed that it would just
> > grow unconstrained. Which is obviously wrong. So I don't understand what
> > your point is.
>
> It will grow unconstrained if you elect to defer queue processing. That
> was what we discussed.

And I just keep pointing out that you are wrong (this must be the 4th time).

We were talking about deferring the periodic queue reaping. SLQB will still
constrain the queue sizes to the high watermarks.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-04  4:07                                               ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-04  4:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wednesday 04 February 2009 04:33:14 Christoph Lameter wrote:
> On Tue, 3 Feb 2009, Nick Piggin wrote:
> > Quite obviously it should. Behaviour of a slab allocation on behalf of
> > some task constrained within a given node should not depend on the task
> > which has previously run on this CPU and made some allocations. Surely
> > you can see this behaviour is not nice.
>
> If you want cache hot objects then its better to use what a prior task
> has used. This opportunistic use is only done if the task is not asking
> for memory from a specifc node. There is another tradeoff here.
>
> SLABs method there is to ignore all caching advantages even if the task
> did not ask for memory from a specific node. So it gets cache cold objects
> and if the node to allow from is remote then it always must use the slow
> path.

Yeah, but I don't think you actually demonstrated any real advantages
to it, and there are obvious failure modes where constraints aren't
obeyed, so I'm going to leave it as-is in SLQB.

Objects where cache hotness tends to be most important are the shorter
lived ones, and objects where constraints matter are longer lived ones,
so I think this is pretty reasonable.

Also, you've just been spending lots of time arguing that cache hotness
is not so important (because SLUB doesn't do LIFO like SLAB and SLQB).


> > > Which have similar issues since memory policy application is depending
> > > on a task policy and on memory migration that has been applied to an
> > > address range.
> >
> > What similar issues? If a task ask to have slab allocations constrained
> > to node 0, then SLUB hands out objects from other nodes, then that's bad.
>
> Of course. A task can ask to have allocations from node 0 and it will get
> the object from node 0. But if the task does not care to ask for data
> from a specific node then it can be satisfied from the cpu slab which
> contains cache hot objects.

But if it is using constrained allocations, then it is also asking for
allocations from node 0.


> > > > But that is wrong. The lists obviously have high water marks that
> > > > get trimmed down. Periodic trimming as I keep saying basically is
> > > > alrady so infrequent that it is irrelevant (millions of objects
> > > > per cpu can be allocated anyway between existing trimming interval)
> > >
> > > Trimming through water marks and allocating memory from the page
> > > allocator is going to be very frequent if you continually allocate on
> > > one processor and free on another.
> >
> > Um yes, that's the point. But you previously claimed that it would just
> > grow unconstrained. Which is obviously wrong. So I don't understand what
> > your point is.
>
> It will grow unconstrained if you elect to defer queue processing. That
> was what we discussed.

And I just keep pointing out that you are wrong (this must be the 4th time).

We were talking about deferring the periodic queue reaping. SLQB will still
constrain the queue sizes to the high watermarks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 18:42                                               ` Pekka Enberg
@ 2009-02-03 18:47                                                 ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-03 18:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Feb 3, 2009 at 8:42 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> It will grow unconstrained if you elect to defer queue processing. That
>> was what we discussed.
>
> Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> the size of the queue. But we do the same thing in SLAB with
> alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> I'm not sure what you mean when you say that the queues will "grow
> unconstrained" (in either of the allocators). Hmm?

That said, I can imagine a worst-case scenario where a queue with N
objects is pinning N mostly empty slabs. As soon as we hit the
periodical flush, we might need to do tons of work. That's pretty hard
to control with watermarks as well as the scenario is solely dependent
on allocation/free patterns.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03 18:47                                                 ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-03 18:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Feb 3, 2009 at 8:42 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> It will grow unconstrained if you elect to defer queue processing. That
>> was what we discussed.
>
> Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
> the size of the queue. But we do the same thing in SLAB with
> alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
> I'm not sure what you mean when you say that the queues will "grow
> unconstrained" (in either of the allocators). Hmm?

That said, I can imagine a worst-case scenario where a queue with N
objects is pinning N mostly empty slabs. As soon as we hit the
periodical flush, we might need to do tons of work. That's pretty hard
to control with watermarks as well as the scenario is solely dependent
on allocation/free patterns.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03 17:33                                             ` Christoph Lameter
@ 2009-02-03 18:42                                               ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-03 18:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Christoph,

On Tue, Feb 3, 2009 at 7:33 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>> > Trimming through water marks and allocating memory from the page allocator
>> > is going to be very frequent if you continually allocate on one processor
>> > and free on another.
>>
>> Um yes, that's the point. But you previously claimed that it would just
>> grow unconstrained. Which is obviously wrong. So I don't understand what
>> your point is.
>
> It will grow unconstrained if you elect to defer queue processing. That
> was what we discussed.

Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
the size of the queue. But we do the same thing in SLAB with
alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
I'm not sure what you mean when you say that the queues will "grow
unconstrained" (in either of the allocators). Hmm?

                               Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03 18:42                                               ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-02-03 18:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Christoph,

On Tue, Feb 3, 2009 at 7:33 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>> > Trimming through water marks and allocating memory from the page allocator
>> > is going to be very frequent if you continually allocate on one processor
>> > and free on another.
>>
>> Um yes, that's the point. But you previously claimed that it would just
>> grow unconstrained. Which is obviously wrong. So I don't understand what
>> your point is.
>
> It will grow unconstrained if you elect to defer queue processing. That
> was what we discussed.

Well, the slab_hiwater() check in __slab_free() of mm/slqb.c will cap
the size of the queue. But we do the same thing in SLAB with
alien->limit in cache_free_alien() and ac->limit in __cache_free(). So
I'm not sure what you mean when you say that the queues will "grow
unconstrained" (in either of the allocators). Hmm?

                               Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-02-03  1:53                                           ` Nick Piggin
@ 2009-02-03 17:33                                             ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-03 17:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 3 Feb 2009, Nick Piggin wrote:

> Quite obviously it should. Behaviour of a slab allocation on behalf of
> some task constrained within a given node should not depend on the task
> which has previously run on this CPU and made some allocations. Surely
> you can see this behaviour is not nice.

If you want cache hot objects then its better to use what a prior task
has used. This opportunistic use is only done if the task is not asking
for memory from a specifc node. There is another tradeoff here.

SLABs method there is to ignore all caching advantages even if the task
did not ask for memory from a specific node. So it gets cache cold objects
and if the node to allow from is remote then it always must use the slow
path.

> > Which have similar issues since memory policy application is depending on
> > a task policy and on memory migration that has been applied to an address
> > range.
>
> What similar issues? If a task ask to have slab allocations constrained
> to node 0, then SLUB hands out objects from other nodes, then that's bad.

Of course. A task can ask to have allocations from node 0 and it will get
the object from node 0. But if the task does not care to ask for data
from a specific node then it can be satisfied from the cpu slab which
contains cache hot objects.

> > > But that is wrong. The lists obviously have high water marks that
> > > get trimmed down. Periodic trimming as I keep saying basically is
> > > alrady so infrequent that it is irrelevant (millions of objects
> > > per cpu can be allocated anyway between existing trimming interval)
> >
> > Trimming through water marks and allocating memory from the page allocator
> > is going to be very frequent if you continually allocate on one processor
> > and free on another.
>
> Um yes, that's the point. But you previously claimed that it would just
> grow unconstrained. Which is obviously wrong. So I don't understand what
> your point is.

It will grow unconstrained if you elect to defer queue processing. That
was what we discussed.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03 17:33                                             ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-02-03 17:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 3 Feb 2009, Nick Piggin wrote:

> Quite obviously it should. Behaviour of a slab allocation on behalf of
> some task constrained within a given node should not depend on the task
> which has previously run on this CPU and made some allocations. Surely
> you can see this behaviour is not nice.

If you want cache hot objects then its better to use what a prior task
has used. This opportunistic use is only done if the task is not asking
for memory from a specifc node. There is another tradeoff here.

SLABs method there is to ignore all caching advantages even if the task
did not ask for memory from a specific node. So it gets cache cold objects
and if the node to allow from is remote then it always must use the slow
path.

> > Which have similar issues since memory policy application is depending on
> > a task policy and on memory migration that has been applied to an address
> > range.
>
> What similar issues? If a task ask to have slab allocations constrained
> to node 0, then SLUB hands out objects from other nodes, then that's bad.

Of course. A task can ask to have allocations from node 0 and it will get
the object from node 0. But if the task does not care to ask for data
from a specific node then it can be satisfied from the cpu slab which
contains cache hot objects.

> > > But that is wrong. The lists obviously have high water marks that
> > > get trimmed down. Periodic trimming as I keep saying basically is
> > > alrady so infrequent that it is irrelevant (millions of objects
> > > per cpu can be allocated anyway between existing trimming interval)
> >
> > Trimming through water marks and allocating memory from the page allocator
> > is going to be very frequent if you continually allocate on one processor
> > and free on another.
>
> Um yes, that's the point. But you previously claimed that it would just
> grow unconstrained. Which is obviously wrong. So I don't understand what
> your point is.

It will grow unconstrained if you elect to defer queue processing. That
was what we discussed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-26 17:28                                         ` Christoph Lameter
@ 2009-02-03  1:53                                           ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:28:03 Christoph Lameter wrote:
> n Fri, 23 Jan 2009, Nick Piggin wrote:
> > According to memory policies, a task's memory policy is supposed to
> > apply to its slab allocations too.
>
> It does apply to slab allocations. The question is whether it has to apply
> to every object allocation or to every page allocation of the slab
> allocators.

Quite obviously it should. Behaviour of a slab allocation on behalf of
some task constrained within a given node should not depend on the task
which has previously run on this CPU and made some allocations. Surely
you can see this behaviour is not nice.


> > > Memory policies are applied in a fuzzy way anyways. A context switch
> > > can result in page allocation action that changes the expected
> > > interleave pattern. Page populations in an address space depend on the
> > > task policy. So the exact policy applied to a page depends on the task.
> > > This isnt an exact thing.
> >
> > There are other memory policies than just interleave though.
>
> Which have similar issues since memory policy application is depending on
> a task policy and on memory migration that has been applied to an address
> range.

What similar issues? If a task ask to have slab allocations constrained
to node 0, then SLUB hands out objects from other nodes, then that's bad.


> > But that is wrong. The lists obviously have high water marks that
> > get trimmed down. Periodic trimming as I keep saying basically is
> > alrady so infrequent that it is irrelevant (millions of objects
> > per cpu can be allocated anyway between existing trimming interval)
>
> Trimming through water marks and allocating memory from the page allocator
> is going to be very frequent if you continually allocate on one processor
> and free on another.

Um yes, that's the point. But you previously claimed that it would just
grow unconstrained. Which is obviously wrong. So I don't understand what
your point is.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03  1:53                                           ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:28:03 Christoph Lameter wrote:
> n Fri, 23 Jan 2009, Nick Piggin wrote:
> > According to memory policies, a task's memory policy is supposed to
> > apply to its slab allocations too.
>
> It does apply to slab allocations. The question is whether it has to apply
> to every object allocation or to every page allocation of the slab
> allocators.

Quite obviously it should. Behaviour of a slab allocation on behalf of
some task constrained within a given node should not depend on the task
which has previously run on this CPU and made some allocations. Surely
you can see this behaviour is not nice.


> > > Memory policies are applied in a fuzzy way anyways. A context switch
> > > can result in page allocation action that changes the expected
> > > interleave pattern. Page populations in an address space depend on the
> > > task policy. So the exact policy applied to a page depends on the task.
> > > This isnt an exact thing.
> >
> > There are other memory policies than just interleave though.
>
> Which have similar issues since memory policy application is depending on
> a task policy and on memory migration that has been applied to an address
> range.

What similar issues? If a task ask to have slab allocations constrained
to node 0, then SLUB hands out objects from other nodes, then that's bad.


> > But that is wrong. The lists obviously have high water marks that
> > get trimmed down. Periodic trimming as I keep saying basically is
> > alrady so infrequent that it is irrelevant (millions of objects
> > per cpu can be allocated anyway between existing trimming interval)
>
> Trimming through water marks and allocating memory from the page allocator
> is going to be very frequent if you continually allocate on one processor
> and free on another.

Um yes, that's the point. But you previously claimed that it would just
grow unconstrained. Which is obviously wrong. So I don't understand what
your point is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-26 17:34                                   ` Christoph Lameter
@ 2009-02-03  1:48                                     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:34:21 Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> > > SLUB can directly free an object to any slab page. "Queuing" on free
> > > via the per cpu slab is only possible if the object came from that per
> > > cpu slab. This is typically only the case for objects that were
> > > recently allocated.
> >
> > Ah yes ok that's right. But then you don't get LIFO allocation
> > behaviour for those cases.
>
> But you get more TLB local allocations.

Not necessarily at all. Because when the "active" page runs out, you've
lost all the LIFO information about objects with active caches and TLBs.


> > > Yes you can loose track of caching hot objects. That is one of the
> > > concerns with the SLUB approach. On the other hand: Caching
> > > architectures get more and more complex these days (especially in a
> > > NUMA system). The
> >
> > Because it is more important to get good cache behaviour.
>
> Its going to be quite difficult to realize algorithm that guestimate what
> information the processor keeps in its caches. The situation is quite
> complex in NUMA systems.

LIFO is fine.


> > So I think it is wrong to say it requires more metadata handling. SLUB
> > will have to switch pages more often or free objects to pages other than
> > the "fast" page (what do you call it?), so quite often I think you'll
> > find SLUB has just as much if not more metadata handling.
>
> Its the per cpu slab. SLUB does not switch pages often but frees objects
> not from the per cpu slab directly with minimal overhead compared to a per
> cpu slab free. The overhead is much less than the SLAB slowpath which has
> to be taken for alien caches etc.

But the slab allocator isn't just about allocating. It is also about
freeing. And you can be switching pages frequently in the freeing path.
And depending on allocation patterns, it can still be quite frequent
in the allocation path too (and even if you have gigantic pages, they
can still get mostly filled up which reduces your queue size and
increases rate of switching between them).


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03  1:48                                     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:34:21 Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> > > SLUB can directly free an object to any slab page. "Queuing" on free
> > > via the per cpu slab is only possible if the object came from that per
> > > cpu slab. This is typically only the case for objects that were
> > > recently allocated.
> >
> > Ah yes ok that's right. But then you don't get LIFO allocation
> > behaviour for those cases.
>
> But you get more TLB local allocations.

Not necessarily at all. Because when the "active" page runs out, you've
lost all the LIFO information about objects with active caches and TLBs.


> > > Yes you can loose track of caching hot objects. That is one of the
> > > concerns with the SLUB approach. On the other hand: Caching
> > > architectures get more and more complex these days (especially in a
> > > NUMA system). The
> >
> > Because it is more important to get good cache behaviour.
>
> Its going to be quite difficult to realize algorithm that guestimate what
> information the processor keeps in its caches. The situation is quite
> complex in NUMA systems.

LIFO is fine.


> > So I think it is wrong to say it requires more metadata handling. SLUB
> > will have to switch pages more often or free objects to pages other than
> > the "fast" page (what do you call it?), so quite often I think you'll
> > find SLUB has just as much if not more metadata handling.
>
> Its the per cpu slab. SLUB does not switch pages often but frees objects
> not from the per cpu slab directly with minimal overhead compared to a per
> cpu slab free. The overhead is much less than the SLAB slowpath which has
> to be taken for alien caches etc.

But the slab allocator isn't just about allocating. It is also about
freeing. And you can be switching pages frequently in the freeing path.
And depending on allocation patterns, it can still be quite frequent
in the allocation path too (and even if you have gigantic pages, they
can still get mostly filled up which reduces your queue size and
increases rate of switching between them).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-26 17:46                                     ` Christoph Lameter
@ 2009-02-03  1:42                                       ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:46:49 Christoph Lameter wrote:
> On Sat, 24 Jan 2009, Nick Piggin wrote:
> > > > SLUB can directly free an object to any slab page. "Queuing" on free
> > > > via the per cpu slab is only possible if the object came from that
> > > > per cpu slab. This is typically only the case for objects that were
> > > > recently allocated.
> > >
> > > Ah yes ok that's right. But then you don't get LIFO allocation
> > > behaviour for those cases.
> >
> > And actually really this all just stems from conceptually in fact you
> > _do_ switch to a different queue (from the one being allocated from)
> > to free the object if it is on a different page. Because you have a
> > set of queues (a queue per-page). So freeing to a different queue is
> > where you lose LIFO property.
>
> Yes you basically go for locality instead of LIFO if the free does not hit
> the per cpu slab. If the object is not in the per cpu slab then it is
> likely that it had a long lifetime and thus LIFOness does not matter
> too much. It is likely that many objects from that slab are going to be
> freed at the same time. So the first free warms up the "queue" of the page
> you are freeing to.

I don't really understand this. It is easy to lose cache hotness information.
Free two objects from different pages. The first one to be freed is likely
to be cache hot, but it will not be allocated again (any time soon).


> This is an increasingly important feature since memory chips prefer
> allocations next to each other. Same page accesses are faster
> in recent memory subsystems than random accesses across memory.

DRAM chips? How about avoiding the problem and keeping the objects in cache
so you don't have to go to RAM.


> LIFO used
> to be better but we are increasingly getting into locality of access being
> very important for access.

Locality of access includes temporal locality. Which is very important. Which
SLUB doesn't do as well at.


> Especially with the NUMA characteristics of the
> existing AMD and upcoming Nehalem processors this will become much more
> important.

Can you demonstrate that LIFO used to be better but no longer is? What
NUMA characteristics are you talking about?


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-02-03  1:42                                       ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-02-03  1:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tuesday 27 January 2009 04:46:49 Christoph Lameter wrote:
> On Sat, 24 Jan 2009, Nick Piggin wrote:
> > > > SLUB can directly free an object to any slab page. "Queuing" on free
> > > > via the per cpu slab is only possible if the object came from that
> > > > per cpu slab. This is typically only the case for objects that were
> > > > recently allocated.
> > >
> > > Ah yes ok that's right. But then you don't get LIFO allocation
> > > behaviour for those cases.
> >
> > And actually really this all just stems from conceptually in fact you
> > _do_ switch to a different queue (from the one being allocated from)
> > to free the object if it is on a different page. Because you have a
> > set of queues (a queue per-page). So freeing to a different queue is
> > where you lose LIFO property.
>
> Yes you basically go for locality instead of LIFO if the free does not hit
> the per cpu slab. If the object is not in the per cpu slab then it is
> likely that it had a long lifetime and thus LIFOness does not matter
> too much. It is likely that many objects from that slab are going to be
> freed at the same time. So the first free warms up the "queue" of the page
> you are freeing to.

I don't really understand this. It is easy to lose cache hotness information.
Free two objects from different pages. The first one to be freed is likely
to be cache hot, but it will not be allocated again (any time soon).


> This is an increasingly important feature since memory chips prefer
> allocations next to each other. Same page accesses are faster
> in recent memory subsystems than random accesses across memory.

DRAM chips? How about avoiding the problem and keeping the objects in cache
so you don't have to go to RAM.


> LIFO used
> to be better but we are increasingly getting into locality of access being
> very important for access.

Locality of access includes temporal locality. Which is very important. Which
SLUB doesn't do as well at.


> Especially with the NUMA characteristics of the
> existing AMD and upcoming Nehalem processors this will become much more
> important.

Can you demonstrate that LIFO used to be better but no longer is? What
NUMA characteristics are you talking about?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 17:09                                   ` Nick Piggin
@ 2009-01-26 17:46                                     ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Sat, 24 Jan 2009, Nick Piggin wrote:

> > > SLUB can directly free an object to any slab page. "Queuing" on free via
> > > the per cpu slab is only possible if the object came from that per cpu
> > > slab. This is typically only the case for objects that were recently
> > > allocated.
> >
> > Ah yes ok that's right. But then you don't get LIFO allocation
> > behaviour for those cases.
>
> And actually really this all just stems from conceptually in fact you
> _do_ switch to a different queue (from the one being allocated from)
> to free the object if it is on a different page. Because you have a
> set of queues (a queue per-page). So freeing to a different queue is
> where you lose LIFO property.

Yes you basically go for locality instead of LIFO if the free does not hit
the per cpu slab. If the object is not in the per cpu slab then it is
likely that it had a long lifetime and thus LIFOness does not matter
too much. It is likely that many objects from that slab are going to be
freed at the same time. So the first free warms up the "queue" of the page
you are freeing to.

This is an increasingly important feature since memory chips prefer
allocations next to each other. Same page accesses are faster
in recent memory subsystems than random accesses across memory. LIFO used
to be better but we are increasingly getting into locality of access being
very important for access. Especially with the NUMA characteristics of the
existing AMD and upcoming Nehalem processors this will become much more
important.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-26 17:46                                     ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Sat, 24 Jan 2009, Nick Piggin wrote:

> > > SLUB can directly free an object to any slab page. "Queuing" on free via
> > > the per cpu slab is only possible if the object came from that per cpu
> > > slab. This is typically only the case for objects that were recently
> > > allocated.
> >
> > Ah yes ok that's right. But then you don't get LIFO allocation
> > behaviour for those cases.
>
> And actually really this all just stems from conceptually in fact you
> _do_ switch to a different queue (from the one being allocated from)
> to free the object if it is on a different page. Because you have a
> set of queues (a queue per-page). So freeing to a different queue is
> where you lose LIFO property.

Yes you basically go for locality instead of LIFO if the free does not hit
the per cpu slab. If the object is not in the per cpu slab then it is
likely that it had a long lifetime and thus LIFOness does not matter
too much. It is likely that many objects from that slab are going to be
freed at the same time. So the first free warms up the "queue" of the page
you are freeing to.

This is an increasingly important feature since memory chips prefer
allocations next to each other. Same page accesses are faster
in recent memory subsystems than random accesses across memory. LIFO used
to be better but we are increasingly getting into locality of access being
very important for access. Especially with the NUMA characteristics of the
existing AMD and upcoming Nehalem processors this will become much more
important.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 16:10                                 ` Nick Piggin
@ 2009-01-26 17:34                                   ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

But you get more TLB local allocations.

> > > hot objects when you switch to different "fast" pages. I don't consider
> > > this to be "queueing done right".
> >
> > Yes you can loose track of caching hot objects. That is one of the
> > concerns with the SLUB approach. On the other hand: Caching architectures
> > get more and more complex these days (especially in a NUMA system). The
>
> Because it is more important to get good cache behaviour.

Its going to be quite difficult to realize algorithm that guestimate what
information the processor keeps in its caches. The situation is quite
complex in NUMA systems.

> So I think it is wrong to say it requires more metadata handling. SLUB
> will have to switch pages more often or free objects to pages other than
> the "fast" page (what do you call it?), so quite often I think you'll
> find SLUB has just as much if not more metadata handling.

Its the per cpu slab. SLUB does not switch pages often but frees objects
not from the per cpu slab directly with minimal overhead compared to a per
cpu slab free. The overhead is much less than the SLAB slowpath which has
to be taken for alien caches etc.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-26 17:34                                   ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

But you get more TLB local allocations.

> > > hot objects when you switch to different "fast" pages. I don't consider
> > > this to be "queueing done right".
> >
> > Yes you can loose track of caching hot objects. That is one of the
> > concerns with the SLUB approach. On the other hand: Caching architectures
> > get more and more complex these days (especially in a NUMA system). The
>
> Because it is more important to get good cache behaviour.

Its going to be quite difficult to realize algorithm that guestimate what
information the processor keeps in its caches. The situation is quite
complex in NUMA systems.

> So I think it is wrong to say it requires more metadata handling. SLUB
> will have to switch pages more often or free objects to pages other than
> the "fast" page (what do you call it?), so quite often I think you'll
> find SLUB has just as much if not more metadata handling.

Its the per cpu slab. SLUB does not switch pages often but frees objects
not from the per cpu slab directly with minimal overhead compared to a per
cpu slab free. The overhead is much less than the SLAB slowpath which has
to be taken for alien caches etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:53                                       ` Nick Piggin
@ 2009-01-26 17:28                                         ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

n Fri, 23 Jan 2009, Nick Piggin wrote:

> According to memory policies, a task's memory policy is supposed to
> apply to its slab allocations too.

It does apply to slab allocations. The question is whether it has to apply
to every object allocation or to every page allocation of the slab
allocators.

> > Memory policies are applied in a fuzzy way anyways. A context switch can
> > result in page allocation action that changes the expected interleave
> > pattern. Page populations in an address space depend on the task policy.
> > So the exact policy applied to a page depends on the task. This isnt an
> > exact thing.
>
> There are other memory policies than just interleave though.

Which have similar issues since memory policy application is depending on
a task policy and on memory migration that has been applied to an address
range.

> > >  "the first cpu will consume more and more memory from the page allocator
> > >   whereas the second will build up huge per cpu lists"
> > >
> > > And this is wrong. There is another possible issue where every single
> > > object on the freelist might come from a different (and otherwise free)
> > > page, and thus eg 100 8 byte objects might consume 400K.
> > >
> > > That's not an invalid concern, but I think it will be quite rare, and
> > > the periodic queue trimming should naturally help this because it will
> > > cycle out those objects and if new allocations are needed, they will
> > > come from new pages which can be packed more densely.
> >
> > Well but you said that you would defer the trimming (due to latency
> > concerns). The longer you defer the larger the lists will get.
>
> But that is wrong. The lists obviously have high water marks that
> get trimmed down. Periodic trimming as I keep saying basically is
> alrady so infrequent that it is irrelevant (millions of objects
> per cpu can be allocated anyway between existing trimming interval)

Trimming through water marks and allocating memory from the page allocator
is going to be very frequent if you continually allocate on one processor
and free on another.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-26 17:28                                         ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

n Fri, 23 Jan 2009, Nick Piggin wrote:

> According to memory policies, a task's memory policy is supposed to
> apply to its slab allocations too.

It does apply to slab allocations. The question is whether it has to apply
to every object allocation or to every page allocation of the slab
allocators.

> > Memory policies are applied in a fuzzy way anyways. A context switch can
> > result in page allocation action that changes the expected interleave
> > pattern. Page populations in an address space depend on the task policy.
> > So the exact policy applied to a page depends on the task. This isnt an
> > exact thing.
>
> There are other memory policies than just interleave though.

Which have similar issues since memory policy application is depending on
a task policy and on memory migration that has been applied to an address
range.

> > >  "the first cpu will consume more and more memory from the page allocator
> > >   whereas the second will build up huge per cpu lists"
> > >
> > > And this is wrong. There is another possible issue where every single
> > > object on the freelist might come from a different (and otherwise free)
> > > page, and thus eg 100 8 byte objects might consume 400K.
> > >
> > > That's not an invalid concern, but I think it will be quite rare, and
> > > the periodic queue trimming should naturally help this because it will
> > > cycle out those objects and if new allocations are needed, they will
> > > come from new pages which can be packed more densely.
> >
> > Well but you said that you would defer the trimming (due to latency
> > concerns). The longer you defer the larger the lists will get.
>
> But that is wrong. The lists obviously have high water marks that
> get trimmed down. Periodic trimming as I keep saying basically is
> alrady so infrequent that it is irrelevant (millions of objects
> per cpu can be allocated anyway between existing trimming interval)

Trimming through water marks and allocating memory from the page allocator
is going to be very frequent if you continually allocate on one processor
and free on another.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 16:10                                 ` Nick Piggin
@ 2009-01-23 17:09                                   ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 17:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Pekka Enberg, Zhang, Yanmin, Lin Ming,
	linux-mm, linux-kernel, Andrew Morton, Linus Torvalds

On Saturday 24 January 2009 03:10:17 Nick Piggin wrote:
> On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> > On Fri, 23 Jan 2009, Nick Piggin wrote:
> > > > Typically we traverse lists of objects that are in the same slab
> > > > cache.
> > >
> > > Very often that is not the case. And the price you pay for that is that
> > > you have to drain and switch freelists whenever you encounter an object
> > > that is not on the same page.
> >
> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

And actually really this all just stems from conceptually in fact you
_do_ switch to a different queue (from the one being allocated from)
to free the object if it is on a different page. Because you have a
set of queues (a queue per-page). So freeing to a different queue is
where you lose LIFO property.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 17:09                                   ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 17:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Pekka Enberg, Zhang, Yanmin, Lin Ming,
	linux-mm, linux-kernel, Andrew Morton, Linus Torvalds

On Saturday 24 January 2009 03:10:17 Nick Piggin wrote:
> On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> > On Fri, 23 Jan 2009, Nick Piggin wrote:
> > > > Typically we traverse lists of objects that are in the same slab
> > > > cache.
> > >
> > > Very often that is not the case. And the price you pay for that is that
> > > you have to drain and switch freelists whenever you encounter an object
> > > that is not on the same page.
> >
> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

And actually really this all just stems from conceptually in fact you
_do_ switch to a different queue (from the one being allocated from)
to free the object if it is on a different page. Because you have a
set of queues (a queue per-page). So freeing to a different queue is
where you lose LIFO property.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:52                               ` Christoph Lameter
@ 2009-01-23 16:10                                 ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> > > Typically we traverse lists of objects that are in the same slab cache.
> >
> > Very often that is not the case. And the price you pay for that is that
> > you have to drain and switch freelists whenever you encounter an object
> > that is not on the same page.
> 
> SLUB can directly free an object to any slab page. "Queuing" on free via
> the per cpu slab is only possible if the object came from that per cpu
> slab. This is typically only the case for objects that were recently
> allocated.

Ah yes ok that's right. But then you don't get LIFO allocation
behaviour for those cases.

 
> > This gives your freelists a chaotic and unpredictable behaviour IMO in
> > a running system where pages succumb to fragmentation so your freelist
> > maximum sizes are limited. It also means you can lose track of cache
> > hot objects when you switch to different "fast" pages. I don't consider
> > this to be "queueing done right".
> 
> Yes you can loose track of caching hot objects. That is one of the
> concerns with the SLUB approach. On the other hand: Caching architectures
> get more and more complex these days (especially in a NUMA system). The

Because it is more important to get good cache behaviour. 


> SLAB approach is essentially trying to guess which objects are cache hot
> and queue them. Sometimes the queueing is advantageous (may be a reason
> that SLAB is better than SLUB in some cases). In other cases SLAB keeps
> objects on queues but the object have become sale (context switch, slab
> unused for awhile). Then its no advantage anymore.

But in those cases would be expected to be encountered if that slab
is not used as frequently, ergo less performance critical. And
ones that are used frequently should be more likely to have recently
freed cache hot objects.


> > > If all objects are from the same page then you need not check
> > > the NUMA locality of any object on that queue.
> >
> > In SLAB and SLQB, all objects on the freelist are on the same node. So
> > tell me how does same-page objects simplify numa  handling?
> 
> F.e. On free you need to determine the node to find the right queue in
> SLAB. SLUB does not need to do that. It simply determines the page address
> and does not care about the node when freeing the object. It is irrelevant
> on which node the object sits.

OK, but how much does that help?

 
> Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
> does only matter if the caller wants memory from a particular node. So
> cpus that have no local memory can still use the per cpu slabs to have
> fast allocations etc etc.

Yeah. In my experience I haven't needed to optimise this type of behaviour
yet, but other allocators could definitely do similar things to switch their
queues to different nodes.


> > > > And you found you have to increase the size of your pages because you
> > > > need bigger queues. (must we argue semantics? it is a list of free
> > > > objects)
> > >
> > > Right. That may be the case and its a similar tuning to what SLAB does.
> >
> > SLAB and SLQB doesn't need bigger pages to do that.
> 
> But they require more metadata handling because they need to manage lists
> of order-0 pages. metadata handling is reduced by orders of magnitude in
> SLUB.

SLQB's page lists typically get accessed eg. 1% of the time (sometimes far
less, other workloads more). So it is several orders of magnitude removed
from the fastpath which is handled by the freelist.

So I think it is wrong to say it requires more metadata handling. SLUB
will have to switch pages more often or free objects to pages other than
the "fast" page (what do you call it?), so quite often I think you'll
find SLUB has just as much if not more metadata handling.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 16:10                                 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> > > Typically we traverse lists of objects that are in the same slab cache.
> >
> > Very often that is not the case. And the price you pay for that is that
> > you have to drain and switch freelists whenever you encounter an object
> > that is not on the same page.
> 
> SLUB can directly free an object to any slab page. "Queuing" on free via
> the per cpu slab is only possible if the object came from that per cpu
> slab. This is typically only the case for objects that were recently
> allocated.

Ah yes ok that's right. But then you don't get LIFO allocation
behaviour for those cases.

 
> > This gives your freelists a chaotic and unpredictable behaviour IMO in
> > a running system where pages succumb to fragmentation so your freelist
> > maximum sizes are limited. It also means you can lose track of cache
> > hot objects when you switch to different "fast" pages. I don't consider
> > this to be "queueing done right".
> 
> Yes you can loose track of caching hot objects. That is one of the
> concerns with the SLUB approach. On the other hand: Caching architectures
> get more and more complex these days (especially in a NUMA system). The

Because it is more important to get good cache behaviour. 


> SLAB approach is essentially trying to guess which objects are cache hot
> and queue them. Sometimes the queueing is advantageous (may be a reason
> that SLAB is better than SLUB in some cases). In other cases SLAB keeps
> objects on queues but the object have become sale (context switch, slab
> unused for awhile). Then its no advantage anymore.

But in those cases would be expected to be encountered if that slab
is not used as frequently, ergo less performance critical. And
ones that are used frequently should be more likely to have recently
freed cache hot objects.


> > > If all objects are from the same page then you need not check
> > > the NUMA locality of any object on that queue.
> >
> > In SLAB and SLQB, all objects on the freelist are on the same node. So
> > tell me how does same-page objects simplify numa  handling?
> 
> F.e. On free you need to determine the node to find the right queue in
> SLAB. SLUB does not need to do that. It simply determines the page address
> and does not care about the node when freeing the object. It is irrelevant
> on which node the object sits.

OK, but how much does that help?

 
> Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
> does only matter if the caller wants memory from a particular node. So
> cpus that have no local memory can still use the per cpu slabs to have
> fast allocations etc etc.

Yeah. In my experience I haven't needed to optimise this type of behaviour
yet, but other allocators could definitely do similar things to switch their
queues to different nodes.


> > > > And you found you have to increase the size of your pages because you
> > > > need bigger queues. (must we argue semantics? it is a list of free
> > > > objects)
> > >
> > > Right. That may be the case and its a similar tuning to what SLAB does.
> >
> > SLAB and SLQB doesn't need bigger pages to do that.
> 
> But they require more metadata handling because they need to manage lists
> of order-0 pages. metadata handling is reduced by orders of magnitude in
> SLUB.

SLQB's page lists typically get accessed eg. 1% of the time (sometimes far
less, other workloads more). So it is several orders of magnitude removed
from the fastpath which is handled by the freelist.

So I think it is wrong to say it requires more metadata handling. SLUB
will have to switch pages more often or free objects to pages other than
the "fast" page (what do you call it?), so quite often I think you'll
find SLUB has just as much if not more metadata handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:41                                     ` Christoph Lameter
@ 2009-01-23 15:53                                       ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 15:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 23, 2009 at 10:41:15AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> >
> > I don't know how that solves the problem. Task with memory policy A
> > allocates an object, which allocates the "fast" page with policy A
> > and allocates an object. Then context switch to task with memory
> > policy B which allocates another object, which is taken from the page
> > allocated with policy A. Right?
> 
> Correct. But this is only an issue if you think about policies applying to
> individual object allocations (like realized in SLAB). If policies only
> apply to pages (which is sufficient for balancing IMHO) then this is okay.

According to memory policies, a task's memory policy is supposed to
apply to its slab allocations too.

 
> > > (OK this doesn't give the wrong policy 100% of the time; I thought
> > there could have been a context switch race during page allocation
> > that would result in 100% incorrect, but anyway it could still be
> > significantly incorrect couldn't it?)
> 
> Memory policies are applied in a fuzzy way anyways. A context switch can
> result in page allocation action that changes the expected interleave
> pattern. Page populations in an address space depend on the task policy.
> So the exact policy applied to a page depends on the task. This isnt an
> exact thing.

There are other memory policies than just interleave though.

 
> >  "the first cpu will consume more and more memory from the page allocator
> >   whereas the second will build up huge per cpu lists"
> >
> > And this is wrong. There is another possible issue where every single
> > object on the freelist might come from a different (and otherwise free)
> > page, and thus eg 100 8 byte objects might consume 400K.
> >
> > That's not an invalid concern, but I think it will be quite rare, and
> > the periodic queue trimming should naturally help this because it will
> > cycle out those objects and if new allocations are needed, they will
> > come from new pages which can be packed more densely.
> 
> Well but you said that you would defer the trimming (due to latency
> concerns). The longer you defer the larger the lists will get.

But that is wrong. The lists obviously have high water marks that
get trimmed down. Periodic trimming as I keep saying basically is
alrady so infrequent that it is irrelevant (millions of objects
per cpu can be allocated anyway between existing trimming interval)

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:53                                       ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23 15:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 23, 2009 at 10:41:15AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> 
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> >
> > I don't know how that solves the problem. Task with memory policy A
> > allocates an object, which allocates the "fast" page with policy A
> > and allocates an object. Then context switch to task with memory
> > policy B which allocates another object, which is taken from the page
> > allocated with policy A. Right?
> 
> Correct. But this is only an issue if you think about policies applying to
> individual object allocations (like realized in SLAB). If policies only
> apply to pages (which is sufficient for balancing IMHO) then this is okay.

According to memory policies, a task's memory policy is supposed to
apply to its slab allocations too.

 
> > > (OK this doesn't give the wrong policy 100% of the time; I thought
> > there could have been a context switch race during page allocation
> > that would result in 100% incorrect, but anyway it could still be
> > significantly incorrect couldn't it?)
> 
> Memory policies are applied in a fuzzy way anyways. A context switch can
> result in page allocation action that changes the expected interleave
> pattern. Page populations in an address space depend on the task policy.
> So the exact policy applied to a page depends on the task. This isnt an
> exact thing.

There are other memory policies than just interleave though.

 
> >  "the first cpu will consume more and more memory from the page allocator
> >   whereas the second will build up huge per cpu lists"
> >
> > And this is wrong. There is another possible issue where every single
> > object on the freelist might come from a different (and otherwise free)
> > page, and thus eg 100 8 byte objects might consume 400K.
> >
> > That's not an invalid concern, but I think it will be quite rare, and
> > the periodic queue trimming should naturally help this because it will
> > cycle out those objects and if new allocations are needed, they will
> > come from new pages which can be packed more densely.
> 
> Well but you said that you would defer the trimming (due to latency
> concerns). The longer you defer the larger the lists will get.

But that is wrong. The lists obviously have high water marks that
get trimmed down. Periodic trimming as I keep saying basically is
alrady so infrequent that it is irrelevant (millions of objects
per cpu can be allocated anyway between existing trimming interval)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  4:17                             ` Nick Piggin
@ 2009-01-23 15:52                               ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > > scale systems is that no matter what you do, you can't avoid the fact that
> > > computing is about locality. Even if you totally take the TLB out of the
> > > equation, you still have the small detail of other caches. Code that jumps
> > > all over that 1024 TB of memory with no locality is going to suck regardless
> > > of what the kernel ever does, due to physical limitations of hardware.
> >
> > Typically we traverse lists of objects that are in the same slab cache.
>
> Very often that is not the case. And the price you pay for that is that
> you have to drain and switch freelists whenever you encounter an object
> that is not on the same page.

SLUB can directly free an object to any slab page. "Queuing" on free via
the per cpu slab is only possible if the object came from that per cpu
slab. This is typically only the case for objects that were recently
allocated.

There is no switching of queues because they do not exist in that form in
SLUB. We always determine the page address and put the object into the
freelist of that page. Also results in nice parallelism since the lock is
not even cpu specific.

> This gives your freelists a chaotic and unpredictable behaviour IMO in
> a running system where pages succumb to fragmentation so your freelist
> maximum sizes are limited. It also means you can lose track of cache
> hot objects when you switch to different "fast" pages. I don't consider
> this to be "queueing done right".

Yes you can loose track of caching hot objects. That is one of the
concerns with the SLUB approach. On the other hand: Caching architectures
get more and more complex these days (especially in a NUMA system). The
SLAB approach is essentially trying to guess which objects are cache hot
and queue them. Sometimes the queueing is advantageous (may be a reason
that SLAB is better than SLUB in some cases). In other cases SLAB keeps
objects on queues but the object have become sale (context switch, slab
unused for awhile). Then its no advantage anymore.

> > If all objects are from the same page then you need not check
> > the NUMA locality of any object on that queue.
>
> In SLAB and SLQB, all objects on the freelist are on the same node. So
> tell me how does same-page objects simplify numa  handling?

F.e. On free you need to determine the node to find the right queue in
SLAB. SLUB does not need to do that. It simply determines the page address
and does not care about the node when freeing the object. It is irrelevant
on which node the object sits.

Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
does only matter if the caller wants memory from a particular node. So
cpus that have no local memory can still use the per cpu slabs to have
fast allocations etc etc.

> > > And you found you have to increase the size of your pages because you
> > > need bigger queues. (must we argue semantics? it is a list of free
> > > objects)
> >
> > Right. That may be the case and its a similar tuning to what SLAB does.
>
> SLAB and SLQB doesn't need bigger pages to do that.

But they require more metadata handling because they need to manage lists
of order-0 pages. metadata handling is reduced by orders of magnitude in
SLUB.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:52                               ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > > scale systems is that no matter what you do, you can't avoid the fact that
> > > computing is about locality. Even if you totally take the TLB out of the
> > > equation, you still have the small detail of other caches. Code that jumps
> > > all over that 1024 TB of memory with no locality is going to suck regardless
> > > of what the kernel ever does, due to physical limitations of hardware.
> >
> > Typically we traverse lists of objects that are in the same slab cache.
>
> Very often that is not the case. And the price you pay for that is that
> you have to drain and switch freelists whenever you encounter an object
> that is not on the same page.

SLUB can directly free an object to any slab page. "Queuing" on free via
the per cpu slab is only possible if the object came from that per cpu
slab. This is typically only the case for objects that were recently
allocated.

There is no switching of queues because they do not exist in that form in
SLUB. We always determine the page address and put the object into the
freelist of that page. Also results in nice parallelism since the lock is
not even cpu specific.

> This gives your freelists a chaotic and unpredictable behaviour IMO in
> a running system where pages succumb to fragmentation so your freelist
> maximum sizes are limited. It also means you can lose track of cache
> hot objects when you switch to different "fast" pages. I don't consider
> this to be "queueing done right".

Yes you can loose track of caching hot objects. That is one of the
concerns with the SLUB approach. On the other hand: Caching architectures
get more and more complex these days (especially in a NUMA system). The
SLAB approach is essentially trying to guess which objects are cache hot
and queue them. Sometimes the queueing is advantageous (may be a reason
that SLAB is better than SLUB in some cases). In other cases SLAB keeps
objects on queues but the object have become sale (context switch, slab
unused for awhile). Then its no advantage anymore.

> > If all objects are from the same page then you need not check
> > the NUMA locality of any object on that queue.
>
> In SLAB and SLQB, all objects on the freelist are on the same node. So
> tell me how does same-page objects simplify numa  handling?

F.e. On free you need to determine the node to find the right queue in
SLAB. SLUB does not need to do that. It simply determines the page address
and does not care about the node when freeing the object. It is irrelevant
on which node the object sits.

Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
does only matter if the caller wants memory from a particular node. So
cpus that have no local memory can still use the per cpu slabs to have
fast allocations etc etc.

> > > And you found you have to increase the size of your pages because you
> > > need bigger queues. (must we argue semantics? it is a list of free
> > > objects)
> >
> > Right. That may be the case and its a similar tuning to what SLAB does.
>
> SLAB and SLQB doesn't need bigger pages to do that.

But they require more metadata handling because they need to manage lists
of order-0 pages. metadata handling is reduced by orders of magnitude in
SLUB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:37                                           ` Pekka Enberg
@ 2009-01-23 15:42                                             ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > That is, a list of pages that could be returned to the page allocator
> > > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > > this will not help allocators that trigger page allocator pass-through.
>
> On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> > We use the partial list for that.
>
> Even if the slab is totally empty?

The MIN_PARTIAL thingy can keep pages around even if the slab becomes
totally empty in order to avoid page allocator trips.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:42                                             ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > That is, a list of pages that could be returned to the page allocator
> > > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > > this will not help allocators that trigger page allocator pass-through.
>
> On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> > We use the partial list for that.
>
> Even if the slab is totally empty?

The MIN_PARTIAL thingy can keep pages around even if the slab becomes
totally empty in order to avoid page allocator trips.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23  4:09                                   ` Nick Piggin
@ 2009-01-23 15:41                                     ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> I don't know how that solves the problem. Task with memory policy A
> allocates an object, which allocates the "fast" page with policy A
> and allocates an object. Then context switch to task with memory
> policy B which allocates another object, which is taken from the page
> allocated with policy A. Right?

Correct. But this is only an issue if you think about policies applying to
individual object allocations (like realized in SLAB). If policies only
apply to pages (which is sufficient for balancing IMHO) then this is okay.

> > (OK this doesn't give the wrong policy 100% of the time; I thought
> there could have been a context switch race during page allocation
> that would result in 100% incorrect, but anyway it could still be
> significantly incorrect couldn't it?)

Memory policies are applied in a fuzzy way anyways. A context switch can
result in page allocation action that changes the expected interleave
pattern. Page populations in an address space depend on the task policy.
So the exact policy applied to a page depends on the task. This isnt an
exact thing.

>  "the first cpu will consume more and more memory from the page allocator
>   whereas the second will build up huge per cpu lists"
>
> And this is wrong. There is another possible issue where every single
> object on the freelist might come from a different (and otherwise free)
> page, and thus eg 100 8 byte objects might consume 400K.
>
> That's not an invalid concern, but I think it will be quite rare, and
> the periodic queue trimming should naturally help this because it will
> cycle out those objects and if new allocations are needed, they will
> come from new pages which can be packed more densely.

Well but you said that you would defer the trimming (due to latency
concerns). The longer you defer the larger the lists will get.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:41                                     ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> I don't know how that solves the problem. Task with memory policy A
> allocates an object, which allocates the "fast" page with policy A
> and allocates an object. Then context switch to task with memory
> policy B which allocates another object, which is taken from the page
> allocated with policy A. Right?

Correct. But this is only an issue if you think about policies applying to
individual object allocations (like realized in SLAB). If policies only
apply to pages (which is sufficient for balancing IMHO) then this is okay.

> > (OK this doesn't give the wrong policy 100% of the time; I thought
> there could have been a context switch race during page allocation
> that would result in 100% incorrect, but anyway it could still be
> significantly incorrect couldn't it?)

Memory policies are applied in a fuzzy way anyways. A context switch can
result in page allocation action that changes the expected interleave
pattern. Page populations in an address space depend on the task policy.
So the exact policy applied to a page depends on the task. This isnt an
exact thing.

>  "the first cpu will consume more and more memory from the page allocator
>   whereas the second will build up huge per cpu lists"
>
> And this is wrong. There is another possible issue where every single
> object on the freelist might come from a different (and otherwise free)
> page, and thus eg 100 8 byte objects might consume 400K.
>
> That's not an invalid concern, but I think it will be quite rare, and
> the periodic queue trimming should naturally help this because it will
> cycle out those objects and if new allocations are needed, they will
> come from new pages which can be packed more densely.

Well but you said that you would defer the trimming (due to latency
concerns). The longer you defer the larger the lists will get.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-23 15:32                                         ` Christoph Lameter
@ 2009-01-23 15:37                                           ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > That is, a list of pages that could be returned to the page allocator
> > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > this will not help allocators that trigger page allocator pass-through.

On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> We use the partial list for that.

Even if the slab is totally empty?


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:37                                           ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-23 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > That is, a list of pages that could be returned to the page allocator
> > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > this will not help allocators that trigger page allocator pass-through.

On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> We use the partial list for that.

Even if the slab is totally empty?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  9:33                                       ` Pekka Enberg
@ 2009-01-23 15:32                                         ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> That is, a list of pages that could be returned to the page allocator
> but are pooled in SLUB to avoid the page allocator overhead. Note that
> this will not help allocators that trigger page allocator pass-through.

We use the partial list for that.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:32                                         ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> That is, a list of pages that could be returned to the page allocator
> but are pooled in SLUB to avoid the page allocator overhead. Note that
> this will not help allocators that trigger page allocator pass-through.

We use the partial list for that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  9:27                                   ` Pekka Enberg
@ 2009-01-23 15:32                                     ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?

Right thats why there is need for higher order allocs because that
increases the "queue" sizes. If the pages are larger then also the partial
lists will cover more ground. Much of the tuning in SLUB is the page size
setting (remember you can set the order for each slab in slub!). In
SLAB/SLQB the corresponding tuning is through the queue sizes.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23 15:32                                     ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-23 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?

Right thats why there is need for higher order allocs because that
increases the "queue" sizes. If the pages are larger then also the partial
lists will cover more ground. Much of the tuning in SLUB is the page size
setting (remember you can set the order for each slab in slub!). In
SLAB/SLQB the corresponding tuning is through the queue sizes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  0:19                           ` Christoph Lameter
@ 2009-01-23  4:17                             ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  4:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 21, 2009 at 07:19:08PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> 
> > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > scale systems is that no matter what you do, you can't avoid the fact that
> > computing is about locality. Even if you totally take the TLB out of the
> > equation, you still have the small detail of other caches. Code that jumps
> > all over that 1024 TB of memory with no locality is going to suck regardless
> > of what the kernel ever does, due to physical limitations of hardware.
> 
> Typically we traverse lists of objects that are in the same slab cache.

Very often that is not the case. And the price you pay for that is that
you have to drain and switch freelists whenever you encounter an object
that is not on the same page.

This gives your freelists a chaotic and unpredictable behaviour IMO in
a running system where pages succumb to fragmentation so your freelist
maximum sizes are limited. It also means you can lose track of cache
hot objects when you switch to different "fast" pages. I don't consider
this to be "queueing done right".

 
> > > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > > same queue.
> >
> > The last sentence is what I was replying to. Ie. "simplification of
> > numa handling" does not follow from the SLUB implementation of per-page
> > freelists.
> 
> If all objects are from the same page then you need not check
> the NUMA locality of any object on that queue.

In SLAB and SLQB, all objects on the freelist are on the same node. So
tell me how does same-page objects simplify numa  handling?

 
> > > As I sad it pins a single page in the per cpu page and uses that in a way
> > > that you call a queue and I call a freelist.
> >
> > And you found you have to increase the size of your pages because you
> > need bigger queues. (must we argue semantics? it is a list of free
> > objects)
> 
> Right. That may be the case and its a similar tuning to what SLAB does.

SLAB and SLQB doesn't need bigger pages to do that.

 
> > > SLAB and SLUB can have large quantities of objects in their queues that
> > > each can keep a single page out of circulation if its the last
> > > object in that page. This is per queue thing and you have at least two
> >
> > And if that were a problem, SLQB can easily be runtime tuned to keep no
> > objects in its object lists. But as I said, queueing is good, so why
> > would anybody want to get rid of it?
> 
> Queing is sometimes good....
> 
> > Again, this doesn't really go anywhere while we disagree on the
> > fundamental goodliness of queueing. This is just describing the
> > implementation.
> 
> I am not sure that you understand the fine points of queuing in slub. I am
> not a fundamentalist: Queues are good if used the right way and as you say
> SLUB has "queues" designed in a particular fashion that solves issus that
> we had with SLAB queues.
 
OK, and I juts don't think they solved all the problems and they added
other worse ones. And if you would tell me what the problems are and
how to reproduce them (or point to someone who might be able to help
with reproducing them), then I'm confident that I can solve those problems
in SLQB, which has fewer downsides than SLUB. At least I will try my best.

So can you please give a better idea of the problems? "latency sensitive
HPC applications" is about as much help to me solving that as telling
you that "OLTP applications slow down" helps solve one of the problems in
SLUB. 



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  4:17                             ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  4:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 21, 2009 at 07:19:08PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> 
> > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > scale systems is that no matter what you do, you can't avoid the fact that
> > computing is about locality. Even if you totally take the TLB out of the
> > equation, you still have the small detail of other caches. Code that jumps
> > all over that 1024 TB of memory with no locality is going to suck regardless
> > of what the kernel ever does, due to physical limitations of hardware.
> 
> Typically we traverse lists of objects that are in the same slab cache.

Very often that is not the case. And the price you pay for that is that
you have to drain and switch freelists whenever you encounter an object
that is not on the same page.

This gives your freelists a chaotic and unpredictable behaviour IMO in
a running system where pages succumb to fragmentation so your freelist
maximum sizes are limited. It also means you can lose track of cache
hot objects when you switch to different "fast" pages. I don't consider
this to be "queueing done right".

 
> > > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > > same queue.
> >
> > The last sentence is what I was replying to. Ie. "simplification of
> > numa handling" does not follow from the SLUB implementation of per-page
> > freelists.
> 
> If all objects are from the same page then you need not check
> the NUMA locality of any object on that queue.

In SLAB and SLQB, all objects on the freelist are on the same node. So
tell me how does same-page objects simplify numa  handling?

 
> > > As I sad it pins a single page in the per cpu page and uses that in a way
> > > that you call a queue and I call a freelist.
> >
> > And you found you have to increase the size of your pages because you
> > need bigger queues. (must we argue semantics? it is a list of free
> > objects)
> 
> Right. That may be the case and its a similar tuning to what SLAB does.

SLAB and SLQB doesn't need bigger pages to do that.

 
> > > SLAB and SLUB can have large quantities of objects in their queues that
> > > each can keep a single page out of circulation if its the last
> > > object in that page. This is per queue thing and you have at least two
> >
> > And if that were a problem, SLQB can easily be runtime tuned to keep no
> > objects in its object lists. But as I said, queueing is good, so why
> > would anybody want to get rid of it?
> 
> Queing is sometimes good....
> 
> > Again, this doesn't really go anywhere while we disagree on the
> > fundamental goodliness of queueing. This is just describing the
> > implementation.
> 
> I am not sure that you understand the fine points of queuing in slub. I am
> not a fundamentalist: Queues are good if used the right way and as you say
> SLUB has "queues" designed in a particular fashion that solves issus that
> we had with SLAB queues.
 
OK, and I juts don't think they solved all the problems and they added
other worse ones. And if you would tell me what the problems are and
how to reproduce them (or point to someone who might be able to help
with reproducing them), then I'm confident that I can solve those problems
in SLQB, which has fewer downsides than SLUB. At least I will try my best.

So can you please give a better idea of the problems? "latency sensitive
HPC applications" is about as much help to me solving that as telling
you that "OLTP applications slow down" helps solve one of the problems in
SLUB. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  0:13                                 ` Christoph Lameter
@ 2009-01-23  4:09                                   ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  4:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 21, 2009 at 07:13:44PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> 
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> 
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

I don't know how that solves the problem. Task with memory policy A
allocates an object, which allocates the "fast" page with policy A
and allocates an object. Then context switch to task with memory
policy B which allocates another object, which is taken from the page
allocated with policy A. Right?

(OK this doesn't give the wrong policy 100% of the time; I thought
there could have been a context switch race during page allocation
that would result in 100% incorrect, but anyway it could still be
significantly incorrect couldn't it?)

 
> > > That means large amounts of memory are going to be caught in these queues.
> > > If its per cpu and one cpu does allocation and the other frees then the
> > > first cpu will consume more and more memory from the page allocator
> > > whereas the second will build up huge per cpu lists.
> >
> > Wrong. I said I would allow an option to turn off *periodic trimming*.
> > Or just modify the existing tunables or look at making the trimming
> > more fine grained etc etc. I won't know until I see a workload where it
> > hurts, and I will try to solve it then.
> 
> You are not responding to the issue. If you have queues that contain
> objects from multiple pages then every object pointer in these queues can
> pin a page although this actually is a free object.

I am trying to respond to what you raise. "The" issue I thought you
raised above was that SLQB would grow freelists unbounded

 "the first cpu will consume more and more memory from the page allocator
  whereas the second will build up huge per cpu lists"

And this is wrong. There is another possible issue where every single
object on the freelist might come from a different (and otherwise free)
page, and thus eg 100 8 byte objects might consume 400K.

That's not an invalid concern, but I think it will be quite rare, and
the periodic queue trimming should naturally help this because it will
cycle out those objects and if new allocations are needed, they will
come from new pages which can be packed more densely.

 
> > > It seems that on SMP systems SLQB will actually increase the number of
> > > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
> >
> > I don't know what you mean when you say queues, but SLQB has more
> > than 2 queues per CPU. Great. I like them ;)
> 
> This gets better and better.

So no response to my asking where the TLB improvement in SLUB helps,
or where queueing hurts? You complain about not being able to reproduce
Intel's OLTP problem, and yet you won't even _say_ what the problems
are for SLQB. Wheras Intel at least puts a lot of effort into running
tests and helping to analyse things.


> > > SLAB also
> > > has resizable queues.
> >
> > Not significantly because that would require large memory allocations for
> > large queues. And there is no code there to do runtime resizing.
> 
> Groan. Please have a look at do_tune_cpucache() in slab.c

Cool, I didn't realise it had hooks to do runtime resizing. The more
important issue of course is the one of extra cache footprint and
metadata in SLAB's scheme.
 

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-23  4:09                                   ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-23  4:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 21, 2009 at 07:13:44PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> 
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> 
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

I don't know how that solves the problem. Task with memory policy A
allocates an object, which allocates the "fast" page with policy A
and allocates an object. Then context switch to task with memory
policy B which allocates another object, which is taken from the page
allocated with policy A. Right?

(OK this doesn't give the wrong policy 100% of the time; I thought
there could have been a context switch race during page allocation
that would result in 100% incorrect, but anyway it could still be
significantly incorrect couldn't it?)

 
> > > That means large amounts of memory are going to be caught in these queues.
> > > If its per cpu and one cpu does allocation and the other frees then the
> > > first cpu will consume more and more memory from the page allocator
> > > whereas the second will build up huge per cpu lists.
> >
> > Wrong. I said I would allow an option to turn off *periodic trimming*.
> > Or just modify the existing tunables or look at making the trimming
> > more fine grained etc etc. I won't know until I see a workload where it
> > hurts, and I will try to solve it then.
> 
> You are not responding to the issue. If you have queues that contain
> objects from multiple pages then every object pointer in these queues can
> pin a page although this actually is a free object.

I am trying to respond to what you raise. "The" issue I thought you
raised above was that SLQB would grow freelists unbounded

 "the first cpu will consume more and more memory from the page allocator
  whereas the second will build up huge per cpu lists"

And this is wrong. There is another possible issue where every single
object on the freelist might come from a different (and otherwise free)
page, and thus eg 100 8 byte objects might consume 400K.

That's not an invalid concern, but I think it will be quite rare, and
the periodic queue trimming should naturally help this because it will
cycle out those objects and if new allocations are needed, they will
come from new pages which can be packed more densely.

 
> > > It seems that on SMP systems SLQB will actually increase the number of
> > > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
> >
> > I don't know what you mean when you say queues, but SLQB has more
> > than 2 queues per CPU. Great. I like them ;)
> 
> This gets better and better.

So no response to my asking where the TLB improvement in SLUB helps,
or where queueing hurts? You complain about not being able to reproduce
Intel's OLTP problem, and yet you won't even _say_ what the problems
are for SLQB. Wheras Intel at least puts a lot of effort into running
tests and helping to analyse things.


> > > SLAB also
> > > has resizable queues.
> >
> > Not significantly because that would require large memory allocations for
> > large queues. And there is no code there to do runtime resizing.
> 
> Groan. Please have a look at do_tune_cpucache() in slab.c

Cool, I didn't realise it had hooks to do runtime resizing. The more
important issue of course is the one of extra cache footprint and
metadata in SLAB's scheme.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  9:30                                     ` Zhang, Yanmin
@ 2009-01-22  9:33                                       ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:33 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 2009-01-22 at 17:30 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> > Hi Christoph,
> > 
> > On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > > allocations will be of the wrong policy.
> > > > >
> > > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > > then this could be an issue.
> > > >
> > > > And it could be a problem in SLUB too. Chances are that several allocations
> > > > will be wrong after every policy switch. I could describe situations in which
> > > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> > 
> > On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> > 
> > There's one potential problem with "per-page queues", though. The bigger
> > the object, the smaller the "queue" (i.e. less objects per page). Also,
> > partial lists are less likely to help for big objects because they get
> > emptied so quickly and returned to the page allocator. Perhaps we should
> > do a small "full list" for caches with large objects?
> That helps definitely. We could use a batch to control the list size.

s/full list/empty list/g

That is, a list of pages that could be returned to the page allocator
but are pooled in SLUB to avoid the page allocator overhead. Note that
this will not help allocators that trigger page allocator pass-through.

		Pekka


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  9:33                                       ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:33 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 2009-01-22 at 17:30 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> > Hi Christoph,
> > 
> > On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > > allocations will be of the wrong policy.
> > > > >
> > > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > > then this could be an issue.
> > > >
> > > > And it could be a problem in SLUB too. Chances are that several allocations
> > > > will be wrong after every policy switch. I could describe situations in which
> > > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> > 
> > On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> > 
> > There's one potential problem with "per-page queues", though. The bigger
> > the object, the smaller the "queue" (i.e. less objects per page). Also,
> > partial lists are less likely to help for big objects because they get
> > emptied so quickly and returned to the page allocator. Perhaps we should
> > do a small "full list" for caches with large objects?
> That helps definitely. We could use a batch to control the list size.

s/full list/empty list/g

That is, a list of pages that could be returned to the page allocator
but are pooled in SLUB to avoid the page allocator overhead. Note that
this will not help allocators that trigger page allocator pass-through.

		Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  9:27                                   ` Pekka Enberg
@ 2009-01-22  9:30                                     ` Zhang, Yanmin
  -1 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  9:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> Hi Christoph,
> 
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > allocations will be of the wrong policy.
> > > >
> > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > then this could be an issue.
> > >
> > > And it could be a problem in SLUB too. Chances are that several allocations
> > > will be wrong after every policy switch. I could describe situations in which
> > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> 
> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
> 
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?
That helps definitely. We could use a batch to control the list size.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  9:30                                     ` Zhang, Yanmin
  0 siblings, 0 replies; 197+ messages in thread
From: Zhang, Yanmin @ 2009-01-22  9:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> Hi Christoph,
> 
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > allocations will be of the wrong policy.
> > > >
> > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > then this could be an issue.
> > >
> > > And it could be a problem in SLUB too. Chances are that several allocations
> > > will be wrong after every policy switch. I could describe situations in which
> > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> 
> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
> 
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?
That helps definitely. We could use a batch to control the list size.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-22  0:13                                 ` Christoph Lameter
@ 2009-01-22  9:27                                   ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

Hi Christoph,

On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > if memory policy changes (eg context switch or something), then subsequent
> > > > allocations will be of the wrong policy.
> > >
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.

On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

There's one potential problem with "per-page queues", though. The bigger
the object, the smaller the "queue" (i.e. less objects per page). Also,
partial lists are less likely to help for big objects because they get
emptied so quickly and returned to the page allocator. Perhaps we should
do a small "full list" for caches with large objects?

			Pekka


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  9:27                                   ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-22  9:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

Hi Christoph,

On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > if memory policy changes (eg context switch or something), then subsequent
> > > > allocations will be of the wrong policy.
> > >
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.

On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

There's one potential problem with "per-page queues", though. The bigger
the object, the smaller the "queue" (i.e. less objects per page). Also,
partial lists are less likely to help for big objects because they get
emptied so quickly and returned to the page allocator. Perhaps we should
do a small "full list" for caches with large objects?

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-19  5:47                         ` Nick Piggin
@ 2009-01-22  0:19                           ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-22  0:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Mon, 19 Jan 2009, Nick Piggin wrote:

> The thing IMO you forget with all these doomsday scenarios about SGI's peta
> scale systems is that no matter what you do, you can't avoid the fact that
> computing is about locality. Even if you totally take the TLB out of the
> equation, you still have the small detail of other caches. Code that jumps
> all over that 1024 TB of memory with no locality is going to suck regardless
> of what the kernel ever does, due to physical limitations of hardware.

Typically we traverse lists of objects that are in the same slab cache.

> > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > same queue.
>
> The last sentence is what I was replying to. Ie. "simplification of
> numa handling" does not follow from the SLUB implementation of per-page
> freelists.

If all objects are from the same page then you need not check
the NUMA locality of any object on that queue.

> > As I sad it pins a single page in the per cpu page and uses that in a way
> > that you call a queue and I call a freelist.
>
> And you found you have to increase the size of your pages because you
> need bigger queues. (must we argue semantics? it is a list of free
> objects)

Right. That may be the case and its a similar tuning to what SLAB does.

> > SLAB and SLUB can have large quantities of objects in their queues that
> > each can keep a single page out of circulation if its the last
> > object in that page. This is per queue thing and you have at least two
>
> And if that were a problem, SLQB can easily be runtime tuned to keep no
> objects in its object lists. But as I said, queueing is good, so why
> would anybody want to get rid of it?

Queing is sometimes good....

> Again, this doesn't really go anywhere while we disagree on the
> fundamental goodliness of queueing. This is just describing the
> implementation.

I am not sure that you understand the fine points of queuing in slub. I am
not a fundamentalist: Queues are good if used the right way and as you say
SLUB has "queues" designed in a particular fashion that solves issus that
we had with SLAB queues.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  0:19                           ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-22  0:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Mon, 19 Jan 2009, Nick Piggin wrote:

> The thing IMO you forget with all these doomsday scenarios about SGI's peta
> scale systems is that no matter what you do, you can't avoid the fact that
> computing is about locality. Even if you totally take the TLB out of the
> equation, you still have the small detail of other caches. Code that jumps
> all over that 1024 TB of memory with no locality is going to suck regardless
> of what the kernel ever does, due to physical limitations of hardware.

Typically we traverse lists of objects that are in the same slab cache.

> > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > same queue.
>
> The last sentence is what I was replying to. Ie. "simplification of
> numa handling" does not follow from the SLUB implementation of per-page
> freelists.

If all objects are from the same page then you need not check
the NUMA locality of any object on that queue.

> > As I sad it pins a single page in the per cpu page and uses that in a way
> > that you call a queue and I call a freelist.
>
> And you found you have to increase the size of your pages because you
> need bigger queues. (must we argue semantics? it is a list of free
> objects)

Right. That may be the case and its a similar tuning to what SLAB does.

> > SLAB and SLUB can have large quantities of objects in their queues that
> > each can keep a single page out of circulation if its the last
> > object in that page. This is per queue thing and you have at least two
>
> And if that were a problem, SLQB can easily be runtime tuned to keep no
> objects in its object lists. But as I said, queueing is good, so why
> would anybody want to get rid of it?

Queing is sometimes good....

> Again, this doesn't really go anywhere while we disagree on the
> fundamental goodliness of queueing. This is just describing the
> implementation.

I am not sure that you understand the fine points of queuing in slub. I am
not a fundamentalist: Queues are good if used the right way and as you say
SLUB has "queues" designed in a particular fashion that solves issus that
we had with SLAB queues.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-19  6:18                               ` Nick Piggin
@ 2009-01-22  0:13                                 ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-22  0:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Mon, 19 Jan 2009, Nick Piggin wrote:

> > > You only go to the allocator when the percpu queue goes empty though, so
> > > if memory policy changes (eg context switch or something), then subsequent
> > > allocations will be of the wrong policy.
> >
> > The per cpu queue size in SLUB is limited by the queues only containing
> > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > then this could be an issue.
>
> And it could be a problem in SLUB too. Chances are that several allocations
> will be wrong after every policy switch. I could describe situations in which
> SLUB will allocate with the _wrong_ policy literally 100% of the time.

No it cannot because in SLUB objects must come from the same page.
Multiple objects in a queue will only ever require a single page and not
multiple like in SLAB.

> > That means large amounts of memory are going to be caught in these queues.
> > If its per cpu and one cpu does allocation and the other frees then the
> > first cpu will consume more and more memory from the page allocator
> > whereas the second will build up huge per cpu lists.
>
> Wrong. I said I would allow an option to turn off *periodic trimming*.
> Or just modify the existing tunables or look at making the trimming
> more fine grained etc etc. I won't know until I see a workload where it
> hurts, and I will try to solve it then.

You are not responding to the issue. If you have queues that contain
objects from multiple pages then every object pointer in these queues can
pin a page although this actually is a free object.

> > It seems that on SMP systems SLQB will actually increase the number of
> > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
>
> I don't know what you mean when you say queues, but SLQB has more
> than 2 queues per CPU. Great. I like them ;)

This gets better and better.

> > SLAB also
> > has resizable queues.
>
> Not significantly because that would require large memory allocations for
> large queues. And there is no code there to do runtime resizing.

Groan. Please have a look at do_tune_cpucache() in slab.c

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-22  0:13                                 ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-22  0:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Mon, 19 Jan 2009, Nick Piggin wrote:

> > > You only go to the allocator when the percpu queue goes empty though, so
> > > if memory policy changes (eg context switch or something), then subsequent
> > > allocations will be of the wrong policy.
> >
> > The per cpu queue size in SLUB is limited by the queues only containing
> > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > then this could be an issue.
>
> And it could be a problem in SLUB too. Chances are that several allocations
> will be wrong after every policy switch. I could describe situations in which
> SLUB will allocate with the _wrong_ policy literally 100% of the time.

No it cannot because in SLUB objects must come from the same page.
Multiple objects in a queue will only ever require a single page and not
multiple like in SLAB.

> > That means large amounts of memory are going to be caught in these queues.
> > If its per cpu and one cpu does allocation and the other frees then the
> > first cpu will consume more and more memory from the page allocator
> > whereas the second will build up huge per cpu lists.
>
> Wrong. I said I would allow an option to turn off *periodic trimming*.
> Or just modify the existing tunables or look at making the trimming
> more fine grained etc etc. I won't know until I see a workload where it
> hurts, and I will try to solve it then.

You are not responding to the issue. If you have queues that contain
objects from multiple pages then every object pointer in these queues can
pin a page although this actually is a free object.

> > It seems that on SMP systems SLQB will actually increase the number of
> > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
>
> I don't know what you mean when you say queues, but SLQB has more
> than 2 queues per CPU. Great. I like them ;)

This gets better and better.

> > SLAB also
> > has resizable queues.
>
> Not significantly because that would require large memory allocations for
> large queues. And there is no code there to do runtime resizing.

Groan. Please have a look at do_tune_cpucache() in slab.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-16 21:25                             ` Christoph Lameter
@ 2009-01-19  6:18                               ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-19  6:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 16, 2009 at 03:25:05PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
> 
> > > The application that is interrupted has no control over when SLQB runs its
> > > expiration. The longer the queues the longer the holdoff. Look at the
> > > changelogs for various queue expiration things in the kernel. I fixed up a
> > > couple of those over the years for latency reasons.
> >
> > Interrupts and timers etc. as well as preemption by kernel threads happen
> > everywhere in the kernel. I have not seen any reason why slab queue reaping
> > in particular is a problem.
> 
> The slab queues are a particular problem since they are combined with
> timers. So the latency insensitive phase of an HPC app completes and then
> the latency critical part starts to run. SLAB will happily every 2 seconds
> expire a certain amount of objects from all its various queues.

Didn't you just earlier say that HPC apps will do all their memory
allocations up front before the critical part of the run? Anyway,
I have not seen any particular problem of cache reaping every 2 seconds
that would be worse than hardware interrupts or other timers.

But as I said, if I do see evidence of that, I will tweak the queue
reaping. But I prefer to keep it simple and closer to SLAB behaviour
until that time.


> > Any slab allocator is going to have a whole lot of theoretical problems and
> > you simply won't be able to fix them all because some require an oracle or
> > others fundamentally conflict with another theoretical problem.
> 
> I agree there is no point in working on theoretical problems. We are
> talking about practical problems.

You are not saying what the practical problem is, though. Ie. what
exactly is the workload where cache reaping is causing a problem and
what are the results of eliminating that reaping?


> > I concentrate on the main practical problems and the end result. If I see
> > evidence of some problem caused, then I will do my best to fix it.
> 
> You concentrate on the problems that are given to you I guess...

Of course. That would include any problems SGI or you would give me, I would
try to fix them. But I not only concentrate on those, but also ones that
I seek out myself. Eg. the OLTP problem.

 
> > > Well yes with enterprise app you are likely not going to see it. Run HPC
> > > and other low latency tests (Infiniband based and such).
> >
> > So do you have any results or not?
> 
> Of course. I need to repost them? I am no longer employed by the company I
> did the work for. So the test data is no longer accessible to me. You have
> to rely on the material that was posted in the past.

Just links are fine. I could find nothing concrete in the mm/slqb.c
changelogs.

 
> > > It still will have to move objects between queues? Or does it adapt the
> > > slub method of "queue" per page?
> >
> > It has several queues that objects can move between. You keep asserting
> > that this is a problem.
> 
> > > SLUB obeys memory policies. It just uses the page allocator for this by
> > > doing an allocation *without* specifying the node that memory has to come
> > > from. SLAB manages memory strictly per node. So it always has to ask for
> > > memory from a particular node. Hence the need to implement memory policies
> > > in the allocator.
> >
> > You only go to the allocator when the percpu queue goes empty though, so
> > if memory policy changes (eg context switch or something), then subsequent
> > allocations will be of the wrong policy.
> 
> The per cpu queue size in SLUB is limited by the queues only containing
> objects from the same page. If you have large queues like SLAB/SLQB(?)
> then this could be an issue.

And it could be a problem in SLUB too. Chances are that several allocations
will be wrong after every policy switch. I could describe situations in which
SLUB will allocate with the _wrong_ policy literally 100% of the time. 


> > That is what I call a hack, which is made in order to solve a percieved
> > performance problem. The SLAB/SLQB method of checking policy is simple,
> > obviously correct, and until there is a *demonstrated* performance problem
> > with that, then I'm not going to change it.
> 
> Well so far it seems that your tests never even exercise that part of the
> allocators.

It uses code and behaviour from SLAB, which I know that has been
extensively tested in enterprise distros and on just about every
serious production NUMA installation and every hardware vendor
including SGI.

That will satisfy me until somebody reports a problem. Again, I could
only find handwaving in the SLUB changelog, which I definitely have
looked at because I want to find and solve as many issues in this
subsystem as I can.


> > I don't think this is a problem. Anyway, rt systems that care about such
> > tiny latencies can easily prioritise this. And ones that don't care so
> > much have many other sources of interrupts and background processing by
> > the kernel or hardware interrupts.
> 
> How do they prioritize this?

In -rt, by prioritising interrupts. As I said, in the mainline kernel
I am skeptical that this is a problem. If it was a problem, I have
never seen a report, or an obvious simple improvmeent of moving the
reaping into a workqueue which can be prioritised, like was done with
the multi-cpu scheduler when there were problems iwth it.


> > If this actually *is* a problem, I will allow an option to turn of periodic
> > trimming of queues, and allow objects to remain in queues (like the page
> > allocator does with its queues). And just provide hooks to reap them at
> > low memory time.
> 
> That means large amounts of memory are going to be caught in these queues.
> If its per cpu and one cpu does allocation and the other frees then the
> first cpu will consume more and more memory from the page allocator
> whereas the second will build up huge per cpu lists.

Wrong. I said I would allow an option to turn off *periodic trimming*.
Or just modify the existing tunables or look at making the trimming
more fine grained etc etc. I won't know until I see a workload where it
hurts, and I will try to solve it then.


> > It's strange. You percieve these theoretical problems with things that I
> > actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> > queueing, strictly obeying NUMA policies...
> 
> These are issues that we encountered in practice with large systems.
> Pointer chasing performance on many apps is bounded by TLB faults etc.

I would be surprised if SLUB somehow fixed such apps. Especially on
large systems, if you look at the maths.

But I don't dismiss the possibility, and SLQB as I keep repeating,
can do higher order allocations. The reason I bring it up is
because SLUB will get significantly slower for many workloads if
higher order allocations *are not* done. Which is the advantage of
SLAB and SLQB here. This cannot be turned into an advantage for
SLUB because of this TLB issue.


> Strictly obeying NUMA policies causes performance problems in SLAB. Try
> MPOL_INTERLEAVE vs a cpu local allocations.

I will try testing that. Note that I have of course been testing
NUMA things on our small Altix, but I just haven't found anything
interesting enough to post...

 
> > > I still dont see the problem that SLQB is addressing (aside from code
> > > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > > is okay.
> >
> > It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> > constant consumption of array caches of SLAB. It addresses scalability
> > eg in situations with lots of cores per node. It allows resizeable
> > queues. It addresses the code complexity and bootstap hoops of SLAB.
> >
> > It addresses performance and higher order allocation problems of SLUB.
> 
> It seems that on SMP systems SLQB will actually increase the number of
> queues since it needs 2 queues per cpu instead of the 1 of SLAB.

I don't know what you mean when you say queues, but SLQB has more
than 2 queues per CPU. Great. I like them ;)


> SLAB also
> has resizable queues.

Not significantly because that would require large memory allocations for
large queues. And there is no code there to do runtime resizing.


> Code simplification and bootstrap: Great work on
> that. Again good cleanup of SLAB.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-19  6:18                               ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-19  6:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 16, 2009 at 03:25:05PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
> 
> > > The application that is interrupted has no control over when SLQB runs its
> > > expiration. The longer the queues the longer the holdoff. Look at the
> > > changelogs for various queue expiration things in the kernel. I fixed up a
> > > couple of those over the years for latency reasons.
> >
> > Interrupts and timers etc. as well as preemption by kernel threads happen
> > everywhere in the kernel. I have not seen any reason why slab queue reaping
> > in particular is a problem.
> 
> The slab queues are a particular problem since they are combined with
> timers. So the latency insensitive phase of an HPC app completes and then
> the latency critical part starts to run. SLAB will happily every 2 seconds
> expire a certain amount of objects from all its various queues.

Didn't you just earlier say that HPC apps will do all their memory
allocations up front before the critical part of the run? Anyway,
I have not seen any particular problem of cache reaping every 2 seconds
that would be worse than hardware interrupts or other timers.

But as I said, if I do see evidence of that, I will tweak the queue
reaping. But I prefer to keep it simple and closer to SLAB behaviour
until that time.


> > Any slab allocator is going to have a whole lot of theoretical problems and
> > you simply won't be able to fix them all because some require an oracle or
> > others fundamentally conflict with another theoretical problem.
> 
> I agree there is no point in working on theoretical problems. We are
> talking about practical problems.

You are not saying what the practical problem is, though. Ie. what
exactly is the workload where cache reaping is causing a problem and
what are the results of eliminating that reaping?


> > I concentrate on the main practical problems and the end result. If I see
> > evidence of some problem caused, then I will do my best to fix it.
> 
> You concentrate on the problems that are given to you I guess...

Of course. That would include any problems SGI or you would give me, I would
try to fix them. But I not only concentrate on those, but also ones that
I seek out myself. Eg. the OLTP problem.

 
> > > Well yes with enterprise app you are likely not going to see it. Run HPC
> > > and other low latency tests (Infiniband based and such).
> >
> > So do you have any results or not?
> 
> Of course. I need to repost them? I am no longer employed by the company I
> did the work for. So the test data is no longer accessible to me. You have
> to rely on the material that was posted in the past.

Just links are fine. I could find nothing concrete in the mm/slqb.c
changelogs.

 
> > > It still will have to move objects between queues? Or does it adapt the
> > > slub method of "queue" per page?
> >
> > It has several queues that objects can move between. You keep asserting
> > that this is a problem.
> 
> > > SLUB obeys memory policies. It just uses the page allocator for this by
> > > doing an allocation *without* specifying the node that memory has to come
> > > from. SLAB manages memory strictly per node. So it always has to ask for
> > > memory from a particular node. Hence the need to implement memory policies
> > > in the allocator.
> >
> > You only go to the allocator when the percpu queue goes empty though, so
> > if memory policy changes (eg context switch or something), then subsequent
> > allocations will be of the wrong policy.
> 
> The per cpu queue size in SLUB is limited by the queues only containing
> objects from the same page. If you have large queues like SLAB/SLQB(?)
> then this could be an issue.

And it could be a problem in SLUB too. Chances are that several allocations
will be wrong after every policy switch. I could describe situations in which
SLUB will allocate with the _wrong_ policy literally 100% of the time. 


> > That is what I call a hack, which is made in order to solve a percieved
> > performance problem. The SLAB/SLQB method of checking policy is simple,
> > obviously correct, and until there is a *demonstrated* performance problem
> > with that, then I'm not going to change it.
> 
> Well so far it seems that your tests never even exercise that part of the
> allocators.

It uses code and behaviour from SLAB, which I know that has been
extensively tested in enterprise distros and on just about every
serious production NUMA installation and every hardware vendor
including SGI.

That will satisfy me until somebody reports a problem. Again, I could
only find handwaving in the SLUB changelog, which I definitely have
looked at because I want to find and solve as many issues in this
subsystem as I can.


> > I don't think this is a problem. Anyway, rt systems that care about such
> > tiny latencies can easily prioritise this. And ones that don't care so
> > much have many other sources of interrupts and background processing by
> > the kernel or hardware interrupts.
> 
> How do they prioritize this?

In -rt, by prioritising interrupts. As I said, in the mainline kernel
I am skeptical that this is a problem. If it was a problem, I have
never seen a report, or an obvious simple improvmeent of moving the
reaping into a workqueue which can be prioritised, like was done with
the multi-cpu scheduler when there were problems iwth it.


> > If this actually *is* a problem, I will allow an option to turn of periodic
> > trimming of queues, and allow objects to remain in queues (like the page
> > allocator does with its queues). And just provide hooks to reap them at
> > low memory time.
> 
> That means large amounts of memory are going to be caught in these queues.
> If its per cpu and one cpu does allocation and the other frees then the
> first cpu will consume more and more memory from the page allocator
> whereas the second will build up huge per cpu lists.

Wrong. I said I would allow an option to turn off *periodic trimming*.
Or just modify the existing tunables or look at making the trimming
more fine grained etc etc. I won't know until I see a workload where it
hurts, and I will try to solve it then.


> > It's strange. You percieve these theoretical problems with things that I
> > actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> > queueing, strictly obeying NUMA policies...
> 
> These are issues that we encountered in practice with large systems.
> Pointer chasing performance on many apps is bounded by TLB faults etc.

I would be surprised if SLUB somehow fixed such apps. Especially on
large systems, if you look at the maths.

But I don't dismiss the possibility, and SLQB as I keep repeating,
can do higher order allocations. The reason I bring it up is
because SLUB will get significantly slower for many workloads if
higher order allocations *are not* done. Which is the advantage of
SLAB and SLQB here. This cannot be turned into an advantage for
SLUB because of this TLB issue.


> Strictly obeying NUMA policies causes performance problems in SLAB. Try
> MPOL_INTERLEAVE vs a cpu local allocations.

I will try testing that. Note that I have of course been testing
NUMA things on our small Altix, but I just haven't found anything
interesting enough to post...

 
> > > I still dont see the problem that SLQB is addressing (aside from code
> > > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > > is okay.
> >
> > It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> > constant consumption of array caches of SLAB. It addresses scalability
> > eg in situations with lots of cores per node. It allows resizeable
> > queues. It addresses the code complexity and bootstap hoops of SLAB.
> >
> > It addresses performance and higher order allocation problems of SLUB.
> 
> It seems that on SMP systems SLQB will actually increase the number of
> queues since it needs 2 queues per cpu instead of the 1 of SLAB.

I don't know what you mean when you say queues, but SLQB has more
than 2 queues per CPU. Great. I like them ;)


> SLAB also
> has resizable queues.

Not significantly because that would require large memory allocations for
large queues. And there is no code there to do runtime resizing.


> Code simplification and bootstrap: Great work on
> that. Again good cleanup of SLAB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-16 21:07                       ` Christoph Lameter
@ 2009-01-19  5:47                         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-19  5:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 16, 2009 at 03:07:52PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
> 
> > > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > > needs a single one.
> >
> > Yes I know that. But it's pretty theoretical IMO (and I could equally
> > describe a theoretical situation where increased fragmentation in higher
> > order slabs will result in worse TLB coverage).
> 
> Theoretical only for low sizes of memory. If you have terabytes of memory
> then this becomes significant in a pretty fast way.

I don't really buy that as a general statement with no other qualifiers.
If the huge system has a correspondingly increased number of slab
objects, then the potential win is much smaller as system sizes increases.
Say if you have a 1GB RAM system, with 128 2MB TLBs, and suppose you have
a slab that takes 25% of the RAM. Then if you optimally pack the slab for
TLB access of those objects in a random pattern, then you can get 100%
TLB hit for a given random access. And 25% in the case of random allocations.

In a 1TB RAM system, you have ~0.1% chance of TLB hit in the optimally packed
case and ~0.025% chance of hit in the random case. So in that case it is a
much smaller (negligable) possible gain from packing.

And note that we're talking about best possible packing scenario (ie. 2MB
pages vs 4K pages). The standard SLUB tune would not get anywhere near that.

The thing IMO you forget with all these doomsday scenarios about SGI's peta
scale systems is that no matter what you do, you can't avoid the fact that
computing is about locality. Even if you totally take the TLB out of the
equation, you still have the small detail of other caches. Code that jumps
all over that 1024 TB of memory with no locality is going to suck regardless
of what the kernel ever does, due to physical limitations of hardware.

Anyway, this is trivial for SLQB to tune this on those systems that really
care, giving SLUB no advantage. So we needn't really get sidetracked with
this in the context of SLQB.

 
> > > It has lists of free objects that are bound to a particular page. That
> > > simplifies numa handling since all the objects in a "queue" (or page) have
> > > the same NUMA characteristics.
> >
> > The same can be said of SLQB and SLAB as well.
> 
> Sorry not at all. SLAB and SLQB queue objects from different pages in the
> same queue.

The last sentence is what I was replying to. Ie. "simplification of
numa handling" does not follow from the SLUB implementation of per-page
freelists.

 
> > > was assigned to a processor. Memory wastage may only occur because
> > > each processor needs to have a separate page from which to allocate. SLAB
> > > like designs needs to put a large number of objects in queues which may
> > > keep a number of pages in the allocated pages pool although all objects
> > > are unused. That does not occur with slub.
> >
> > That's wrong. SLUB keeps completely free pages on its partial lists, and
> > also IIRC can keep free pages pinned in the per-cpu page. I have actually
> > seen SLQB use less memory than SLUB in some situations for this reason.
> 
> As I sad it pins a single page in the per cpu page and uses that in a way
> that you call a queue and I call a freelist.

And you found you have to increase the size of your pages because you
need bigger queues. (must we argue semantics? it is a list of free
objects)


> SLUB keeps a few pages on the partial list right now because it tries to
> avoid trips to the page allocator (which is quite slow). These could be
> eliminated if the page allocator would work effectively. However that
> number is a per node limit.

This is the practical vs theoretical I'm talking about.


> SLAB and SLUB can have large quantities of objects in their queues that
> each can keep a single page out of circulation if its the last
> object in that page. This is per queue thing and you have at least two

And if that were a problem, SLQB can easily be runtime tuned to keep no
objects in its object lists. But as I said, queueing is good, so why
would anybody want to get rid of it?


> queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
> alien node for each node. That can pin quite a number of pages on large
> systems. Note that SLAB has one per cpu whereas you have already 2 per
> cpu? In SMP configurations this may mean that SLQB has more queues than
> SLAB.

Again, this doesn't really go anywhere while we disagree on the
fundamental goodliness of queueing. This is just describing the
implementation.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-19  5:47                         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-19  5:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, Jan 16, 2009 at 03:07:52PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
> 
> > > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > > needs a single one.
> >
> > Yes I know that. But it's pretty theoretical IMO (and I could equally
> > describe a theoretical situation where increased fragmentation in higher
> > order slabs will result in worse TLB coverage).
> 
> Theoretical only for low sizes of memory. If you have terabytes of memory
> then this becomes significant in a pretty fast way.

I don't really buy that as a general statement with no other qualifiers.
If the huge system has a correspondingly increased number of slab
objects, then the potential win is much smaller as system sizes increases.
Say if you have a 1GB RAM system, with 128 2MB TLBs, and suppose you have
a slab that takes 25% of the RAM. Then if you optimally pack the slab for
TLB access of those objects in a random pattern, then you can get 100%
TLB hit for a given random access. And 25% in the case of random allocations.

In a 1TB RAM system, you have ~0.1% chance of TLB hit in the optimally packed
case and ~0.025% chance of hit in the random case. So in that case it is a
much smaller (negligable) possible gain from packing.

And note that we're talking about best possible packing scenario (ie. 2MB
pages vs 4K pages). The standard SLUB tune would not get anywhere near that.

The thing IMO you forget with all these doomsday scenarios about SGI's peta
scale systems is that no matter what you do, you can't avoid the fact that
computing is about locality. Even if you totally take the TLB out of the
equation, you still have the small detail of other caches. Code that jumps
all over that 1024 TB of memory with no locality is going to suck regardless
of what the kernel ever does, due to physical limitations of hardware.

Anyway, this is trivial for SLQB to tune this on those systems that really
care, giving SLUB no advantage. So we needn't really get sidetracked with
this in the context of SLQB.

 
> > > It has lists of free objects that are bound to a particular page. That
> > > simplifies numa handling since all the objects in a "queue" (or page) have
> > > the same NUMA characteristics.
> >
> > The same can be said of SLQB and SLAB as well.
> 
> Sorry not at all. SLAB and SLQB queue objects from different pages in the
> same queue.

The last sentence is what I was replying to. Ie. "simplification of
numa handling" does not follow from the SLUB implementation of per-page
freelists.

 
> > > was assigned to a processor. Memory wastage may only occur because
> > > each processor needs to have a separate page from which to allocate. SLAB
> > > like designs needs to put a large number of objects in queues which may
> > > keep a number of pages in the allocated pages pool although all objects
> > > are unused. That does not occur with slub.
> >
> > That's wrong. SLUB keeps completely free pages on its partial lists, and
> > also IIRC can keep free pages pinned in the per-cpu page. I have actually
> > seen SLQB use less memory than SLUB in some situations for this reason.
> 
> As I sad it pins a single page in the per cpu page and uses that in a way
> that you call a queue and I call a freelist.

And you found you have to increase the size of your pages because you
need bigger queues. (must we argue semantics? it is a list of free
objects)


> SLUB keeps a few pages on the partial list right now because it tries to
> avoid trips to the page allocator (which is quite slow). These could be
> eliminated if the page allocator would work effectively. However that
> number is a per node limit.

This is the practical vs theoretical I'm talking about.


> SLAB and SLUB can have large quantities of objects in their queues that
> each can keep a single page out of circulation if its the last
> object in that page. This is per queue thing and you have at least two

And if that were a problem, SLQB can easily be runtime tuned to keep no
objects in its object lists. But as I said, queueing is good, so why
would anybody want to get rid of it?


> queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
> alien node for each node. That can pin quite a number of pages on large
> systems. Note that SLAB has one per cpu whereas you have already 2 per
> cpu? In SMP configurations this may mean that SLQB has more queues than
> SLAB.

Again, this doesn't really go anywhere while we disagree on the
fundamental goodliness of queueing. This is just describing the
implementation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-16  3:43                           ` Nick Piggin
@ 2009-01-16 21:25                             ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-16 21:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > The application that is interrupted has no control over when SLQB runs its
> > expiration. The longer the queues the longer the holdoff. Look at the
> > changelogs for various queue expiration things in the kernel. I fixed up a
> > couple of those over the years for latency reasons.
>
> Interrupts and timers etc. as well as preemption by kernel threads happen
> everywhere in the kernel. I have not seen any reason why slab queue reaping
> in particular is a problem.

The slab queues are a particular problem since they are combined with
timers. So the latency insensitive phase of an HPC app completes and then
the latency critical part starts to run. SLAB will happily every 2 seconds
expire a certain amount of objects from all its various queues.

> Any slab allocator is going to have a whole lot of theoretical problems and
> you simply won't be able to fix them all because some require an oracle or
> others fundamentally conflict with another theoretical problem.

I agree there is no point in working on theoretical problems. We are
talking about practical problems.

> I concentrate on the main practical problems and the end result. If I see
> evidence of some problem caused, then I will do my best to fix it.

You concentrate on the problems that are given to you I guess...

> > Well yes with enterprise app you are likely not going to see it. Run HPC
> > and other low latency tests (Infiniband based and such).
>
> So do you have any results or not?

Of course. I need to repost them? I am no longer employed by the company I
did the work for. So the test data is no longer accessible to me. You have
to rely on the material that was posted in the past.

> > It still will have to move objects between queues? Or does it adapt the
> > slub method of "queue" per page?
>
> It has several queues that objects can move between. You keep asserting
> that this is a problem.

> > SLUB obeys memory policies. It just uses the page allocator for this by
> > doing an allocation *without* specifying the node that memory has to come
> > from. SLAB manages memory strictly per node. So it always has to ask for
> > memory from a particular node. Hence the need to implement memory policies
> > in the allocator.
>
> You only go to the allocator when the percpu queue goes empty though, so
> if memory policy changes (eg context switch or something), then subsequent
> allocations will be of the wrong policy.

The per cpu queue size in SLUB is limited by the queues only containing
objects from the same page. If you have large queues like SLAB/SLQB(?)
then this could be an issue.

> That is what I call a hack, which is made in order to solve a percieved
> performance problem. The SLAB/SLQB method of checking policy is simple,
> obviously correct, and until there is a *demonstrated* performance problem
> with that, then I'm not going to change it.

Well so far it seems that your tests never even exercise that part of the
allocators.

> I don't think this is a problem. Anyway, rt systems that care about such
> tiny latencies can easily prioritise this. And ones that don't care so
> much have many other sources of interrupts and background processing by
> the kernel or hardware interrupts.

How do they prioritize this?

> If this actually *is* a problem, I will allow an option to turn of periodic
> trimming of queues, and allow objects to remain in queues (like the page
> allocator does with its queues). And just provide hooks to reap them at
> low memory time.

That means large amounts of memory are going to be caught in these queues.
If its per cpu and one cpu does allocation and the other frees then the
first cpu will consume more and more memory from the page allocator
whereas the second will build up huge per cpu lists.

> It's strange. You percieve these theoretical problems with things that I
> actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> queueing, strictly obeying NUMA policies...

These are issues that we encountered in practice with large systems.
Pointer chasing performance on many apps is bounded by TLB faults etc.
Strictly obeying NUMA policies causes performance problems in SLAB. Try
MPOL_INTERLEAVE vs a cpu local allocations.

> > I still dont see the problem that SLQB is addressing (aside from code
> > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > is okay.
>
> It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> constant consumption of array caches of SLAB. It addresses scalability
> eg in situations with lots of cores per node. It allows resizeable
> queues. It addresses the code complexity and bootstap hoops of SLAB.
>
> It addresses performance and higher order allocation problems of SLUB.

It seems that on SMP systems SLQB will actually increase the number of
queues since it needs 2 queues per cpu instead of the 1 of SLAB. SLAB also
has resizable queues. Code simplification and bootstrap: Great work on
that. Again good cleanup of SLAB.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-16 21:25                             ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-16 21:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > The application that is interrupted has no control over when SLQB runs its
> > expiration. The longer the queues the longer the holdoff. Look at the
> > changelogs for various queue expiration things in the kernel. I fixed up a
> > couple of those over the years for latency reasons.
>
> Interrupts and timers etc. as well as preemption by kernel threads happen
> everywhere in the kernel. I have not seen any reason why slab queue reaping
> in particular is a problem.

The slab queues are a particular problem since they are combined with
timers. So the latency insensitive phase of an HPC app completes and then
the latency critical part starts to run. SLAB will happily every 2 seconds
expire a certain amount of objects from all its various queues.

> Any slab allocator is going to have a whole lot of theoretical problems and
> you simply won't be able to fix them all because some require an oracle or
> others fundamentally conflict with another theoretical problem.

I agree there is no point in working on theoretical problems. We are
talking about practical problems.

> I concentrate on the main practical problems and the end result. If I see
> evidence of some problem caused, then I will do my best to fix it.

You concentrate on the problems that are given to you I guess...

> > Well yes with enterprise app you are likely not going to see it. Run HPC
> > and other low latency tests (Infiniband based and such).
>
> So do you have any results or not?

Of course. I need to repost them? I am no longer employed by the company I
did the work for. So the test data is no longer accessible to me. You have
to rely on the material that was posted in the past.

> > It still will have to move objects between queues? Or does it adapt the
> > slub method of "queue" per page?
>
> It has several queues that objects can move between. You keep asserting
> that this is a problem.

> > SLUB obeys memory policies. It just uses the page allocator for this by
> > doing an allocation *without* specifying the node that memory has to come
> > from. SLAB manages memory strictly per node. So it always has to ask for
> > memory from a particular node. Hence the need to implement memory policies
> > in the allocator.
>
> You only go to the allocator when the percpu queue goes empty though, so
> if memory policy changes (eg context switch or something), then subsequent
> allocations will be of the wrong policy.

The per cpu queue size in SLUB is limited by the queues only containing
objects from the same page. If you have large queues like SLAB/SLQB(?)
then this could be an issue.

> That is what I call a hack, which is made in order to solve a percieved
> performance problem. The SLAB/SLQB method of checking policy is simple,
> obviously correct, and until there is a *demonstrated* performance problem
> with that, then I'm not going to change it.

Well so far it seems that your tests never even exercise that part of the
allocators.

> I don't think this is a problem. Anyway, rt systems that care about such
> tiny latencies can easily prioritise this. And ones that don't care so
> much have many other sources of interrupts and background processing by
> the kernel or hardware interrupts.

How do they prioritize this?

> If this actually *is* a problem, I will allow an option to turn of periodic
> trimming of queues, and allow objects to remain in queues (like the page
> allocator does with its queues). And just provide hooks to reap them at
> low memory time.

That means large amounts of memory are going to be caught in these queues.
If its per cpu and one cpu does allocation and the other frees then the
first cpu will consume more and more memory from the page allocator
whereas the second will build up huge per cpu lists.

> It's strange. You percieve these theoretical problems with things that I
> actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> queueing, strictly obeying NUMA policies...

These are issues that we encountered in practice with large systems.
Pointer chasing performance on many apps is bounded by TLB faults etc.
Strictly obeying NUMA policies causes performance problems in SLAB. Try
MPOL_INTERLEAVE vs a cpu local allocations.

> > I still dont see the problem that SLQB is addressing (aside from code
> > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > is okay.
>
> It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> constant consumption of array caches of SLAB. It addresses scalability
> eg in situations with lots of cores per node. It allows resizeable
> queues. It addresses the code complexity and bootstap hoops of SLAB.
>
> It addresses performance and higher order allocation problems of SLUB.

It seems that on SMP systems SLQB will actually increase the number of
queues since it needs 2 queues per cpu instead of the 1 of SLAB. SLAB also
has resizable queues. Code simplification and bootstrap: Great work on
that. Again good cleanup of SLAB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-16  3:19                     ` Nick Piggin
@ 2009-01-16 21:07                       ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-16 21:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > needs a single one.
>
> Yes I know that. But it's pretty theoretical IMO (and I could equally
> describe a theoretical situation where increased fragmentation in higher
> order slabs will result in worse TLB coverage).

Theoretical only for low sizes of memory. If you have terabytes of memory
then this becomes significant in a pretty fast way.

> > It has lists of free objects that are bound to a particular page. That
> > simplifies numa handling since all the objects in a "queue" (or page) have
> > the same NUMA characteristics.
>
> The same can be said of SLQB and SLAB as well.

Sorry not at all. SLAB and SLQB queue objects from different pages in the
same queue.

> > was assigned to a processor. Memory wastage may only occur because
> > each processor needs to have a separate page from which to allocate. SLAB
> > like designs needs to put a large number of objects in queues which may
> > keep a number of pages in the allocated pages pool although all objects
> > are unused. That does not occur with slub.
>
> That's wrong. SLUB keeps completely free pages on its partial lists, and
> also IIRC can keep free pages pinned in the per-cpu page. I have actually
> seen SLQB use less memory than SLUB in some situations for this reason.

As I sad it pins a single page in the per cpu page and uses that in a way
that you call a queue and I call a freelist.

SLUB keeps a few pages on the partial list right now because it tries to
avoid trips to the page allocator (which is quite slow). These could be
eliminated if the page allocator would work effectively. However that
number is a per node limit.

SLAB and SLUB can have large quantities of objects in their queues that
each can keep a single page out of circulation if its the last
object in that page. This is per queue thing and you have at least two
queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
alien node for each node. That can pin quite a number of pages on large
systems. Note that SLAB has one per cpu whereas you have already 2 per
cpu? In SMP configurations this may mean that SLQB has more queues than
SLAB.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-16 21:07                       ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-16 21:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > needs a single one.
>
> Yes I know that. But it's pretty theoretical IMO (and I could equally
> describe a theoretical situation where increased fragmentation in higher
> order slabs will result in worse TLB coverage).

Theoretical only for low sizes of memory. If you have terabytes of memory
then this becomes significant in a pretty fast way.

> > It has lists of free objects that are bound to a particular page. That
> > simplifies numa handling since all the objects in a "queue" (or page) have
> > the same NUMA characteristics.
>
> The same can be said of SLQB and SLAB as well.

Sorry not at all. SLAB and SLQB queue objects from different pages in the
same queue.

> > was assigned to a processor. Memory wastage may only occur because
> > each processor needs to have a separate page from which to allocate. SLAB
> > like designs needs to put a large number of objects in queues which may
> > keep a number of pages in the allocated pages pool although all objects
> > are unused. That does not occur with slub.
>
> That's wrong. SLUB keeps completely free pages on its partial lists, and
> also IIRC can keep free pages pinned in the per-cpu page. I have actually
> seen SLQB use less memory than SLUB in some situations for this reason.

As I sad it pins a single page in the per cpu page and uses that in a way
that you call a queue and I call a freelist.

SLUB keeps a few pages on the partial list right now because it tries to
avoid trips to the page allocator (which is quite slow). These could be
eliminated if the page allocator would work effectively. However that
number is a per node limit.

SLAB and SLUB can have large quantities of objects in their queues that
each can keep a single page out of circulation if its the last
object in that page. This is per queue thing and you have at least two
queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
alien node for each node. That can pin quite a number of pages on large
systems. Note that SLAB has one per cpu whereas you have already 2 per
cpu? In SMP configurations this may mean that SLQB has more queues than
SLAB.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-15 20:47                         ` Christoph Lameter
@ 2009-01-16  3:43                           ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-16  3:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, Jan 15, 2009 at 02:47:02PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
> 
> > Definitely it is not uncontrollable. And not unchangeable. It is
> > about the least sensitive part of the allocator because in a serious
> > workload, the queues will continually be bounded by watermarks rather
> > than timer reaping.
> 
> The application that is interrupted has no control over when SLQB runs its
> expiration. The longer the queues the longer the holdoff. Look at the
> changelogs for various queue expiration things in the kernel. I fixed up a
> couple of those over the years for latency reasons.

Interrupts and timers etc. as well as preemption by kernel threads happen
everywhere in the kernel. I have not seen any reason why slab queue reaping
in particular is a problem.

Any slab allocator is going to have a whole lot of theoretical problems and
you simply won't be able to fix them all because some require an oracle or
others fundamentally conflict with another theoretical problem.

I concentrate on the main practical problems and the end result. If I see
evidence of some problem caused, then I will do my best to fix it.


> > > Object dispersal
> > > in the kernel address space.
> >
> > You mean due to lower order allocations?
> > 1. I have not seen any results showing this gives a practical performance
> >    increase, let alone one that offsets the downsides of using higher
> >    order allocations.
> 
> Well yes with enterprise app you are likely not going to see it. Run HPC
> and other low latency tests (Infiniband based and such).

So do you have any results or not?

 
> > 2. Increased internal fragmentation may also have the opposite effect and
> >    result in worse packing.
> 
> Memory allocations in latency critical appls are generally done in
> contexts where high latencies are tolerable (f.e. at startup).
> 
> > 3. There is no reason why SLQB can't use higher order allocations if this
> >    is a significant win.
> 
> It still will have to move objects between queues? Or does it adapt the
> slub method of "queue" per page?

It has several queues that objects can move between. You keep asserting
that this is a problem.

 
> > > Memory policy handling in the slab
> > > allocator.
> >
> > I see no reason why this should be a problem. The SLUB merge just asserted
> > it would be a problem. But actually SLAB seems to handle it just fine, and
> > SLUB also doesn't always obey memory policies, so I consider that to be a
> > worse problem, at least until it is justified by performance numbers that
> > show otherwise.
> 
> Well I wrote the code in SLAB that does this. And AFAICT this was a very
> bad hack that I had to put in after all the original developers of the
> NUMA slab stuff vanished and things began to segfault.
> 
> SLUB obeys memory policies. It just uses the page allocator for this by
> doing an allocation *without* specifying the node that memory has to come
> from. SLAB manages memory strictly per node. So it always has to ask for
> memory from a particular node. Hence the need to implement memory policies
> in the allocator.

You only go to the allocator when the percpu queue goes empty though, so
if memory policy changes (eg context switch or something), then subsequent
allocations will be of the wrong policy.

That is what I call a hack, which is made in order to solve a percieved
performance problem. The SLAB/SLQB method of checking policy is simple,
obviously correct, and until there is a *demonstrated* performance problem
with that, then I'm not going to change it.


> > > Even seems to include periodic moving of objects between
> > > queues.
> >
> > The queues expire slowly. Same as SLAB's arrays. You are describing the
> > implementation, and not the problems it has.
> 
> Periodic movement again introduces processing spikes and pollution of the
> cpu caches.

I don't think this is a problem. Anyway, rt systems that care about such
tiny latencies can easily prioritise this. And ones that don't care so
much have many other sources of interrupts and background processing by
the kernel or hardware interrupts.

If this actually *is* a problem, I will allow an option to turn of periodic
trimming of queues, and allow objects to remain in queues (like the page
allocator does with its queues). And just provide hooks to reap them at
low memory time.

 
> > There needs to be some fallback cases added to slowpaths to handle
> > these things, but I don't see why it would take much work.
> 
> The need for that fallback comes from the SLAB methodology used....

The fallback will probably be adapted from SLUB.

 
> > > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > > Solaris allocator from which we got the design for all the queuing stuff
> > > in the first place). It preserves many of the drawbacks of that code.
> >
> > It is _like_ slab. It avoids the major drawbacks of large footprint of
> > array caches, and O(N^2) memory consumption behaviour, and corner cases
> > where scalability is poor. The queueing behaviour of SLAB IMO is not
> > a drawback and it is a big reaon why SLAB is so good.
> 
> Queuing and the explosions of the number of queues with the alien caches
> resulted in the potential of portions of memory vanishing into these
> queues. Queueing means unused objects are in those queues stemming from
> pages that would otherwise (if the the free object would be "moved" back
> to the page) be available for other kernel uses.

It's strange. You percieve these theoretical problems with things that I
actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
queueing, strictly obeying NUMA policies... 


> > > If SLQB would replace SLAB then there would be a lot of shared code
> > > (debugging for example). Having a generic slab allocator framework may
> > > then be possible within which a variety of algorithms may be implemented.
> >
> > The goal is to replace SLAB and SLUB. Anything less would be a failure
> > on behalf of SLQB. Shared code is not a bad thing, but the major problem
> > is the actual core behaviour of the allocator because it affects almost
> > everywhere in the kernel and splitting userbase is not a good thing.
> 
> I still dont see the problem that SLQB is addressing (aside from code
> cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> is okay.

It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
constant consumption of array caches of SLAB. It addresses scalability
eg in situations with lots of cores per node. It allows resizeable
queues. It addresses the code complexity and bootstap hoops of SLAB.

It addresses performance and higher order allocation problems of SLUB.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-16  3:43                           ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-16  3:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, Jan 15, 2009 at 02:47:02PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
> 
> > Definitely it is not uncontrollable. And not unchangeable. It is
> > about the least sensitive part of the allocator because in a serious
> > workload, the queues will continually be bounded by watermarks rather
> > than timer reaping.
> 
> The application that is interrupted has no control over when SLQB runs its
> expiration. The longer the queues the longer the holdoff. Look at the
> changelogs for various queue expiration things in the kernel. I fixed up a
> couple of those over the years for latency reasons.

Interrupts and timers etc. as well as preemption by kernel threads happen
everywhere in the kernel. I have not seen any reason why slab queue reaping
in particular is a problem.

Any slab allocator is going to have a whole lot of theoretical problems and
you simply won't be able to fix them all because some require an oracle or
others fundamentally conflict with another theoretical problem.

I concentrate on the main practical problems and the end result. If I see
evidence of some problem caused, then I will do my best to fix it.


> > > Object dispersal
> > > in the kernel address space.
> >
> > You mean due to lower order allocations?
> > 1. I have not seen any results showing this gives a practical performance
> >    increase, let alone one that offsets the downsides of using higher
> >    order allocations.
> 
> Well yes with enterprise app you are likely not going to see it. Run HPC
> and other low latency tests (Infiniband based and such).

So do you have any results or not?

 
> > 2. Increased internal fragmentation may also have the opposite effect and
> >    result in worse packing.
> 
> Memory allocations in latency critical appls are generally done in
> contexts where high latencies are tolerable (f.e. at startup).
> 
> > 3. There is no reason why SLQB can't use higher order allocations if this
> >    is a significant win.
> 
> It still will have to move objects between queues? Or does it adapt the
> slub method of "queue" per page?

It has several queues that objects can move between. You keep asserting
that this is a problem.

 
> > > Memory policy handling in the slab
> > > allocator.
> >
> > I see no reason why this should be a problem. The SLUB merge just asserted
> > it would be a problem. But actually SLAB seems to handle it just fine, and
> > SLUB also doesn't always obey memory policies, so I consider that to be a
> > worse problem, at least until it is justified by performance numbers that
> > show otherwise.
> 
> Well I wrote the code in SLAB that does this. And AFAICT this was a very
> bad hack that I had to put in after all the original developers of the
> NUMA slab stuff vanished and things began to segfault.
> 
> SLUB obeys memory policies. It just uses the page allocator for this by
> doing an allocation *without* specifying the node that memory has to come
> from. SLAB manages memory strictly per node. So it always has to ask for
> memory from a particular node. Hence the need to implement memory policies
> in the allocator.

You only go to the allocator when the percpu queue goes empty though, so
if memory policy changes (eg context switch or something), then subsequent
allocations will be of the wrong policy.

That is what I call a hack, which is made in order to solve a percieved
performance problem. The SLAB/SLQB method of checking policy is simple,
obviously correct, and until there is a *demonstrated* performance problem
with that, then I'm not going to change it.


> > > Even seems to include periodic moving of objects between
> > > queues.
> >
> > The queues expire slowly. Same as SLAB's arrays. You are describing the
> > implementation, and not the problems it has.
> 
> Periodic movement again introduces processing spikes and pollution of the
> cpu caches.

I don't think this is a problem. Anyway, rt systems that care about such
tiny latencies can easily prioritise this. And ones that don't care so
much have many other sources of interrupts and background processing by
the kernel or hardware interrupts.

If this actually *is* a problem, I will allow an option to turn of periodic
trimming of queues, and allow objects to remain in queues (like the page
allocator does with its queues). And just provide hooks to reap them at
low memory time.

 
> > There needs to be some fallback cases added to slowpaths to handle
> > these things, but I don't see why it would take much work.
> 
> The need for that fallback comes from the SLAB methodology used....

The fallback will probably be adapted from SLUB.

 
> > > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > > Solaris allocator from which we got the design for all the queuing stuff
> > > in the first place). It preserves many of the drawbacks of that code.
> >
> > It is _like_ slab. It avoids the major drawbacks of large footprint of
> > array caches, and O(N^2) memory consumption behaviour, and corner cases
> > where scalability is poor. The queueing behaviour of SLAB IMO is not
> > a drawback and it is a big reaon why SLAB is so good.
> 
> Queuing and the explosions of the number of queues with the alien caches
> resulted in the potential of portions of memory vanishing into these
> queues. Queueing means unused objects are in those queues stemming from
> pages that would otherwise (if the the free object would be "moved" back
> to the page) be available for other kernel uses.

It's strange. You percieve these theoretical problems with things that I
actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
queueing, strictly obeying NUMA policies... 


> > > If SLQB would replace SLAB then there would be a lot of shared code
> > > (debugging for example). Having a generic slab allocator framework may
> > > then be possible within which a variety of algorithms may be implemented.
> >
> > The goal is to replace SLAB and SLUB. Anything less would be a failure
> > on behalf of SLQB. Shared code is not a bad thing, but the major problem
> > is the actual core behaviour of the allocator because it affects almost
> > everywhere in the kernel and splitting userbase is not a good thing.
> 
> I still dont see the problem that SLQB is addressing (aside from code
> cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> is okay.

It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
constant consumption of array caches of SLAB. It addresses scalability
eg in situations with lots of cores per node. It allows resizeable
queues. It addresses the code complexity and bootstap hoops of SLAB.

It addresses performance and higher order allocation problems of SLUB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-15 20:05                   ` Christoph Lameter
@ 2009-01-16  3:19                     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-16  3:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, Jan 15, 2009 at 02:05:55PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
> 
> > > The higher orders can fail and will then result in the allocator doing
> > > order 0 allocs. It is not a failure condition.
> >
> > But they increase pressure on the resource and reduce availability to
> > other higher order allocations. They accelerate the breakdown of the
> > anti-frag heuristics, and they make slab internal fragmentation worse.
> > They also simply cost more to allocate and free and reclaim.
> 
> The costs are less since there is no need to have metadata for each of
> the 4k sized pages. Instead there is one contiguous chunk that can be
> tracked as a whole.

In terms of memory footprint of metadata, SLQB and SLUB both use the
struct page for this, so no advantage. In terms of performance, well
it seems that SLUB is not faster with higher order allocations, but
slower with order-0 allocations. So this seems to just be an argument
for using higher order allocations in SLUB, rather than an argument
against using order-0 in SLQB or SLAB>

 
> > > Higher orders are an
> > > advantage because they localize variables of the same type and therefore
> > > reduce TLB pressure.
> >
> > They are also a disadvantage. The disadvantages are very real. The
> > advantage is a bit theoretical (how much really is it going to help
> > going from 4K to 32K, if you still have hundreds or thousands of
> > slabs anyway?). Also, there is no reason why the other allocators
> > cannot use higher orer allocations, but their big advantage is that
> > they don't need to.
> 
> The benefit of going from 4k to 32k is that 8 times as many objects may be
> handled by the same 2M TLB covering a 32k page. If the 4k pages are
> dispersed then you may need 8 2M tlbs (which covers already a quarter of
> the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> needs a single one.

Yes I know that. But it's pretty theoretical IMO (and I could equally
describe a theoretical situation where increased fragmentation in higher
order slabs will result in worse TLB coverage).

Has there been a demonstrated advantage of this? That outweighs the
costs of using higher order allocs? If so, then I could just add some
parameters to SLQB to allow higher order allocs as well.

 
> > I'd like to see any real numbers showing this is a problem. Queue
> > trimming in SLQB can easily be scaled or tweaked to change latency
> > characteristics. The fact is that it isn't a very critical or highly
> > tuned operation. It happens _very_ infrequently in the large scheme
> > of things, and could easily be changed if there is a problem.
> 
> Queue trimming can be configured in the same way in SLAB. But this means
> that you are forever tuning these things as loads vary. Thats one of the

As I said, this is not my experience. Queue trimming is a third order
problem and barely makes any difference in a running workload. It just
serves to clean up unused memory when slabs stop being used. There is
basically no performance implication to it (watermarks are far more
important to queue management of a busy slab).

Tens of millions of objects can be allocated and freed between queue
trimming intervals.


> frustrations that led to the SLUB design. Also if the objects in queues
> are not bound to particular page (as in slub) then traversal of the queues
> can be very TLB fault intensive.

Yeah but it happens so infrequently it should be in the noise. I have
never seen any real problems caused by this. Have you?

 
> > What you have in SLUB IMO is not obviously better because it effectively
> > has sizeable queues in higher order partial and free pages and the
> > active page, which simply never get trimmed, AFAIKS. This can be harmful
> > for slab internal fragmentation as well in some situations.
> 
> It has lists of free objects that are bound to a particular page. That
> simplifies numa handling since all the objects in a "queue" (or page) have
> the same NUMA characteristics.

The same can be said of SLQB and SLAB as well.

> There is no moving between queues
> (there is one exception but in general that is true) because the page list can
> become the percpu list by just using the pointer to the head object.

Exactly the same way SLQB can move a queue of remotely freed objects back
to the allocating processor. It just manipulates a head pointer. It doesn't
walk every object in the list.


> Slab internal fragmentation is already a problem in SLAB. The solution
> would be a targeted reclaim mechanism. Something like what I proposed in
> with the slab defrag patches.
 
It is still a much bigger problem with 16x larger pages. If you have targetted
reclaim and 32K slabs, then you will still have a bigger problem than
targetted reclaim and 4K slabs (and the targetted reclaim will be less
efficient because it will have to free more objects).


> There is no need for trimming since there is no queue in the SLAB sense. A
> page is assigned to a processor and then that processor takes objects off
> the freelist and may free objects back to the freelist of that page that
> was assigned to a processor. Memory wastage may only occur because
> each processor needs to have a separate page from which to allocate. SLAB
> like designs needs to put a large number of objects in queues which may
> keep a number of pages in the allocated pages pool although all objects
> are unused. That does not occur with slub.

That's wrong. SLUB keeps completely free pages on its partial lists, and
also IIRC can keep free pages pinned in the per-cpu page. I have actually
seen SLQB use less memory than SLUB in some situations for this reason.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-16  3:19                     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-16  3:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, Jan 15, 2009 at 02:05:55PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
> 
> > > The higher orders can fail and will then result in the allocator doing
> > > order 0 allocs. It is not a failure condition.
> >
> > But they increase pressure on the resource and reduce availability to
> > other higher order allocations. They accelerate the breakdown of the
> > anti-frag heuristics, and they make slab internal fragmentation worse.
> > They also simply cost more to allocate and free and reclaim.
> 
> The costs are less since there is no need to have metadata for each of
> the 4k sized pages. Instead there is one contiguous chunk that can be
> tracked as a whole.

In terms of memory footprint of metadata, SLQB and SLUB both use the
struct page for this, so no advantage. In terms of performance, well
it seems that SLUB is not faster with higher order allocations, but
slower with order-0 allocations. So this seems to just be an argument
for using higher order allocations in SLUB, rather than an argument
against using order-0 in SLQB or SLAB>

 
> > > Higher orders are an
> > > advantage because they localize variables of the same type and therefore
> > > reduce TLB pressure.
> >
> > They are also a disadvantage. The disadvantages are very real. The
> > advantage is a bit theoretical (how much really is it going to help
> > going from 4K to 32K, if you still have hundreds or thousands of
> > slabs anyway?). Also, there is no reason why the other allocators
> > cannot use higher orer allocations, but their big advantage is that
> > they don't need to.
> 
> The benefit of going from 4k to 32k is that 8 times as many objects may be
> handled by the same 2M TLB covering a 32k page. If the 4k pages are
> dispersed then you may need 8 2M tlbs (which covers already a quarter of
> the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> needs a single one.

Yes I know that. But it's pretty theoretical IMO (and I could equally
describe a theoretical situation where increased fragmentation in higher
order slabs will result in worse TLB coverage).

Has there been a demonstrated advantage of this? That outweighs the
costs of using higher order allocs? If so, then I could just add some
parameters to SLQB to allow higher order allocs as well.

 
> > I'd like to see any real numbers showing this is a problem. Queue
> > trimming in SLQB can easily be scaled or tweaked to change latency
> > characteristics. The fact is that it isn't a very critical or highly
> > tuned operation. It happens _very_ infrequently in the large scheme
> > of things, and could easily be changed if there is a problem.
> 
> Queue trimming can be configured in the same way in SLAB. But this means
> that you are forever tuning these things as loads vary. Thats one of the

As I said, this is not my experience. Queue trimming is a third order
problem and barely makes any difference in a running workload. It just
serves to clean up unused memory when slabs stop being used. There is
basically no performance implication to it (watermarks are far more
important to queue management of a busy slab).

Tens of millions of objects can be allocated and freed between queue
trimming intervals.


> frustrations that led to the SLUB design. Also if the objects in queues
> are not bound to particular page (as in slub) then traversal of the queues
> can be very TLB fault intensive.

Yeah but it happens so infrequently it should be in the noise. I have
never seen any real problems caused by this. Have you?

 
> > What you have in SLUB IMO is not obviously better because it effectively
> > has sizeable queues in higher order partial and free pages and the
> > active page, which simply never get trimmed, AFAIKS. This can be harmful
> > for slab internal fragmentation as well in some situations.
> 
> It has lists of free objects that are bound to a particular page. That
> simplifies numa handling since all the objects in a "queue" (or page) have
> the same NUMA characteristics.

The same can be said of SLQB and SLAB as well.

> There is no moving between queues
> (there is one exception but in general that is true) because the page list can
> become the percpu list by just using the pointer to the head object.

Exactly the same way SLQB can move a queue of remotely freed objects back
to the allocating processor. It just manipulates a head pointer. It doesn't
walk every object in the list.


> Slab internal fragmentation is already a problem in SLAB. The solution
> would be a targeted reclaim mechanism. Something like what I proposed in
> with the slab defrag patches.
 
It is still a much bigger problem with 16x larger pages. If you have targetted
reclaim and 32K slabs, then you will still have a bigger problem than
targetted reclaim and 4K slabs (and the targetted reclaim will be less
efficient because it will have to free more objects).


> There is no need for trimming since there is no queue in the SLAB sense. A
> page is assigned to a processor and then that processor takes objects off
> the freelist and may free objects back to the freelist of that page that
> was assigned to a processor. Memory wastage may only occur because
> each processor needs to have a separate page from which to allocate. SLAB
> like designs needs to put a large number of objects in queues which may
> keep a number of pages in the allocated pages pool although all objects
> are unused. That does not occur with slub.

That's wrong. SLUB keeps completely free pages on its partial lists, and
also IIRC can keep free pages pinned in the per-cpu page. I have actually
seen SLQB use less memory than SLUB in some situations for this reason.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-15  6:19                       ` Nick Piggin
@ 2009-01-15 20:47                         ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 15 Jan 2009, Nick Piggin wrote:

> Definitely it is not uncontrollable. And not unchangeable. It is
> about the least sensitive part of the allocator because in a serious
> workload, the queues will continually be bounded by watermarks rather
> than timer reaping.

The application that is interrupted has no control over when SLQB runs its
expiration. The longer the queues the longer the holdoff. Look at the
changelogs for various queue expiration things in the kernel. I fixed up a
couple of those over the years for latency reasons.

> > Object dispersal
> > in the kernel address space.
>
> You mean due to lower order allocations?
> 1. I have not seen any results showing this gives a practical performance
>    increase, let alone one that offsets the downsides of using higher
>    order allocations.

Well yes with enterprise app you are likely not going to see it. Run HPC
and other low latency tests (Infiniband based and such).

> 2. Increased internal fragmentation may also have the opposite effect and
>    result in worse packing.

Memory allocations in latency critical appls are generally done in
contexts where high latencies are tolerable (f.e. at startup).

> 3. There is no reason why SLQB can't use higher order allocations if this
>    is a significant win.

It still will have to move objects between queues? Or does it adapt the
slub method of "queue" per page?

> > Memory policy handling in the slab
> > allocator.
>
> I see no reason why this should be a problem. The SLUB merge just asserted
> it would be a problem. But actually SLAB seems to handle it just fine, and
> SLUB also doesn't always obey memory policies, so I consider that to be a
> worse problem, at least until it is justified by performance numbers that
> show otherwise.

Well I wrote the code in SLAB that does this. And AFAICT this was a very
bad hack that I had to put in after all the original developers of the
NUMA slab stuff vanished and things began to segfault.

SLUB obeys memory policies. It just uses the page allocator for this by
doing an allocation *without* specifying the node that memory has to come
from. SLAB manages memory strictly per node. So it always has to ask for
memory from a particular node. Hence the need to implement memory policies
in the allocator.

> > Even seems to include periodic moving of objects between
> > queues.
>
> The queues expire slowly. Same as SLAB's arrays. You are describing the
> implementation, and not the problems it has.

Periodic movement again introduces processing spikes and pollution of the
cpu caches.

> There needs to be some fallback cases added to slowpaths to handle
> these things, but I don't see why it would take much work.

The need for that fallback comes from the SLAB methodology used....

> > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > Solaris allocator from which we got the design for all the queuing stuff
> > in the first place). It preserves many of the drawbacks of that code.
>
> It is _like_ slab. It avoids the major drawbacks of large footprint of
> array caches, and O(N^2) memory consumption behaviour, and corner cases
> where scalability is poor. The queueing behaviour of SLAB IMO is not
> a drawback and it is a big reaon why SLAB is so good.

Queuing and the explosions of the number of queues with the alien caches
resulted in the potential of portions of memory vanishing into these
queues. Queueing means unused objects are in those queues stemming from
pages that would otherwise (if the the free object would be "moved" back
to the page) be available for other kernel uses.

> > If SLQB would replace SLAB then there would be a lot of shared code
> > (debugging for example). Having a generic slab allocator framework may
> > then be possible within which a variety of algorithms may be implemented.
>
> The goal is to replace SLAB and SLUB. Anything less would be a failure
> on behalf of SLQB. Shared code is not a bad thing, but the major problem
> is the actual core behaviour of the allocator because it affects almost
> everywhere in the kernel and splitting userbase is not a good thing.

I still dont see the problem that SLQB is addressing (aside from code
cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
is okay.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-15 20:47                         ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 15 Jan 2009, Nick Piggin wrote:

> Definitely it is not uncontrollable. And not unchangeable. It is
> about the least sensitive part of the allocator because in a serious
> workload, the queues will continually be bounded by watermarks rather
> than timer reaping.

The application that is interrupted has no control over when SLQB runs its
expiration. The longer the queues the longer the holdoff. Look at the
changelogs for various queue expiration things in the kernel. I fixed up a
couple of those over the years for latency reasons.

> > Object dispersal
> > in the kernel address space.
>
> You mean due to lower order allocations?
> 1. I have not seen any results showing this gives a practical performance
>    increase, let alone one that offsets the downsides of using higher
>    order allocations.

Well yes with enterprise app you are likely not going to see it. Run HPC
and other low latency tests (Infiniband based and such).

> 2. Increased internal fragmentation may also have the opposite effect and
>    result in worse packing.

Memory allocations in latency critical appls are generally done in
contexts where high latencies are tolerable (f.e. at startup).

> 3. There is no reason why SLQB can't use higher order allocations if this
>    is a significant win.

It still will have to move objects between queues? Or does it adapt the
slub method of "queue" per page?

> > Memory policy handling in the slab
> > allocator.
>
> I see no reason why this should be a problem. The SLUB merge just asserted
> it would be a problem. But actually SLAB seems to handle it just fine, and
> SLUB also doesn't always obey memory policies, so I consider that to be a
> worse problem, at least until it is justified by performance numbers that
> show otherwise.

Well I wrote the code in SLAB that does this. And AFAICT this was a very
bad hack that I had to put in after all the original developers of the
NUMA slab stuff vanished and things began to segfault.

SLUB obeys memory policies. It just uses the page allocator for this by
doing an allocation *without* specifying the node that memory has to come
from. SLAB manages memory strictly per node. So it always has to ask for
memory from a particular node. Hence the need to implement memory policies
in the allocator.

> > Even seems to include periodic moving of objects between
> > queues.
>
> The queues expire slowly. Same as SLAB's arrays. You are describing the
> implementation, and not the problems it has.

Periodic movement again introduces processing spikes and pollution of the
cpu caches.

> There needs to be some fallback cases added to slowpaths to handle
> these things, but I don't see why it would take much work.

The need for that fallback comes from the SLAB methodology used....

> > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > Solaris allocator from which we got the design for all the queuing stuff
> > in the first place). It preserves many of the drawbacks of that code.
>
> It is _like_ slab. It avoids the major drawbacks of large footprint of
> array caches, and O(N^2) memory consumption behaviour, and corner cases
> where scalability is poor. The queueing behaviour of SLAB IMO is not
> a drawback and it is a big reaon why SLAB is so good.

Queuing and the explosions of the number of queues with the alien caches
resulted in the potential of portions of memory vanishing into these
queues. Queueing means unused objects are in those queues stemming from
pages that would otherwise (if the the free object would be "moved" back
to the page) be available for other kernel uses.

> > If SLQB would replace SLAB then there would be a lot of shared code
> > (debugging for example). Having a generic slab allocator framework may
> > then be possible within which a variety of algorithms may be implemented.
>
> The goal is to replace SLAB and SLUB. Anything less would be a failure
> on behalf of SLQB. Shared code is not a bad thing, but the major problem
> is the actual core behaviour of the allocator because it affects almost
> everywhere in the kernel and splitting userbase is not a good thing.

I still dont see the problem that SLQB is addressing (aside from code
cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
is okay.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-15  6:03                 ` Nick Piggin
@ 2009-01-15 20:05                   ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 15 Jan 2009, Nick Piggin wrote:

> > The higher orders can fail and will then result in the allocator doing
> > order 0 allocs. It is not a failure condition.
>
> But they increase pressure on the resource and reduce availability to
> other higher order allocations. They accelerate the breakdown of the
> anti-frag heuristics, and they make slab internal fragmentation worse.
> They also simply cost more to allocate and free and reclaim.

The costs are less since there is no need to have metadata for each of
the 4k sized pages. Instead there is one contiguous chunk that can be
tracked as a whole.

> > Higher orders are an
> > advantage because they localize variables of the same type and therefore
> > reduce TLB pressure.
>
> They are also a disadvantage. The disadvantages are very real. The
> advantage is a bit theoretical (how much really is it going to help
> going from 4K to 32K, if you still have hundreds or thousands of
> slabs anyway?). Also, there is no reason why the other allocators
> cannot use higher orer allocations, but their big advantage is that
> they don't need to.

The benefit of going from 4k to 32k is that 8 times as many objects may be
handled by the same 2M TLB covering a 32k page. If the 4k pages are
dispersed then you may need 8 2M tlbs (which covers already a quarter of
the available 2M TLBs on nehalem f.e.) for which the larger alloc just
needs a single one.

> > > The idea of removing queues doesn't seem so good to me. Queues are good.
> > > You amortize or avoid all sorts of things with queues. We have them
> > > everywhere in the kernel ;)
> >
> > Queues require maintenance which introduces variability because queue
> > cleaning has to be done periodically and the queues grow in number if NUMA
> > scenarios have to be handled effectively. This is a big problem for low
> > latency applications (like in HPC). Spending far too much time optimizing
> > queue cleaning in SLAB lead to the SLUB idea.
>
> I'd like to see any real numbers showing this is a problem. Queue
> trimming in SLQB can easily be scaled or tweaked to change latency
> characteristics. The fact is that it isn't a very critical or highly
> tuned operation. It happens _very_ infrequently in the large scheme
> of things, and could easily be changed if there is a problem.

Queue trimming can be configured in the same way in SLAB. But this means
that you are forever tuning these things as loads vary. Thats one of the
frustrations that led to the SLUB design. Also if the objects in queues
are not bound to particular page (as in slub) then traversal of the queues
can be very TLB fault intensive.

> What you have in SLUB IMO is not obviously better because it effectively
> has sizeable queues in higher order partial and free pages and the
> active page, which simply never get trimmed, AFAIKS. This can be harmful
> for slab internal fragmentation as well in some situations.

It has lists of free objects that are bound to a particular page. That
simplifies numa handling since all the objects in a "queue" (or page) have
the same NUMA characteristics. There is no moving between queues
(there is one exception but in general that is true) because the page list can
become the percpu list by just using the pointer to the head object.

Slab internal fragmentation is already a problem in SLAB. The solution
would be a targeted reclaim mechanism. Something like what I proposed in
with the slab defrag patches.

There is no need for trimming since there is no queue in the SLAB sense. A
page is assigned to a processor and then that processor takes objects off
the freelist and may free objects back to the freelist of that page that
was assigned to a processor. Memory wastage may only occur because
each processor needs to have a separate page from which to allocate. SLAB
like designs needs to put a large number of objects in queues which may
keep a number of pages in the allocated pages pool although all objects
are unused. That does not occur with slub.




^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-15 20:05                   ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Thu, 15 Jan 2009, Nick Piggin wrote:

> > The higher orders can fail and will then result in the allocator doing
> > order 0 allocs. It is not a failure condition.
>
> But they increase pressure on the resource and reduce availability to
> other higher order allocations. They accelerate the breakdown of the
> anti-frag heuristics, and they make slab internal fragmentation worse.
> They also simply cost more to allocate and free and reclaim.

The costs are less since there is no need to have metadata for each of
the 4k sized pages. Instead there is one contiguous chunk that can be
tracked as a whole.

> > Higher orders are an
> > advantage because they localize variables of the same type and therefore
> > reduce TLB pressure.
>
> They are also a disadvantage. The disadvantages are very real. The
> advantage is a bit theoretical (how much really is it going to help
> going from 4K to 32K, if you still have hundreds or thousands of
> slabs anyway?). Also, there is no reason why the other allocators
> cannot use higher orer allocations, but their big advantage is that
> they don't need to.

The benefit of going from 4k to 32k is that 8 times as many objects may be
handled by the same 2M TLB covering a 32k page. If the 4k pages are
dispersed then you may need 8 2M tlbs (which covers already a quarter of
the available 2M TLBs on nehalem f.e.) for which the larger alloc just
needs a single one.

> > > The idea of removing queues doesn't seem so good to me. Queues are good.
> > > You amortize or avoid all sorts of things with queues. We have them
> > > everywhere in the kernel ;)
> >
> > Queues require maintenance which introduces variability because queue
> > cleaning has to be done periodically and the queues grow in number if NUMA
> > scenarios have to be handled effectively. This is a big problem for low
> > latency applications (like in HPC). Spending far too much time optimizing
> > queue cleaning in SLAB lead to the SLUB idea.
>
> I'd like to see any real numbers showing this is a problem. Queue
> trimming in SLQB can easily be scaled or tweaked to change latency
> characteristics. The fact is that it isn't a very critical or highly
> tuned operation. It happens _very_ infrequently in the large scheme
> of things, and could easily be changed if there is a problem.

Queue trimming can be configured in the same way in SLAB. But this means
that you are forever tuning these things as loads vary. Thats one of the
frustrations that led to the SLUB design. Also if the objects in queues
are not bound to particular page (as in slub) then traversal of the queues
can be very TLB fault intensive.

> What you have in SLUB IMO is not obviously better because it effectively
> has sizeable queues in higher order partial and free pages and the
> active page, which simply never get trimmed, AFAIKS. This can be harmful
> for slab internal fragmentation as well in some situations.

It has lists of free objects that are bound to a particular page. That
simplifies numa handling since all the objects in a "queue" (or page) have
the same NUMA characteristics. There is no moving between queues
(there is one exception but in general that is true) because the page list can
become the percpu list by just using the pointer to the head object.

Slab internal fragmentation is already a problem in SLAB. The solution
would be a targeted reclaim mechanism. Something like what I proposed in
with the slab defrag patches.

There is no need for trimming since there is no queue in the SLAB sense. A
page is assigned to a processor and then that processor takes objects off
the freelist and may free objects back to the freelist of that page that
was assigned to a processor. Memory wastage may only occur because
each processor needs to have a separate page from which to allocate. SLAB
like designs needs to put a large number of objects in queues which may
keep a number of pages in the allocated pages pool although all objects
are unused. That does not occur with slub.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 18:40                     ` Christoph Lameter
@ 2009-01-15  6:19                       ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-15  6:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:40:12PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
> 
> > Well if you would like to consider SLQB as a fix for SLUB, that's
> > fine by me ;) Actually I guess it is a valid way to look at the problem:
> > SLQB solves the OLTP regression, so the only question is "what is the
> > downside of it?".
> 
> The downside is that it brings the SLAB stuff back that SLUB was
> designed to avoid. Queue expiration.

What's this mean? Something distinct from periodic timer?

> The use of timers to expire at
> uncontrollable intervals for user space.

I am not convinced this is a problem. I would like to see evidence
that it is a problem, but I have only seen assertions.

Definitely it is not uncontrollable. And not unchangeable. It is
about the least sensitive part of the allocator because in a serious
workload, the queues will continually be bounded by watermarks rather
than timer reaping.


> Object dispersal
> in the kernel address space.

You mean due to lower order allocations?
1. I have not seen any results showing this gives a practical performance
   increase, let alone one that offsets the downsides of using higher
   order allocations.
2. Increased internal fragmentation may also have the opposite effect and
   result in worse packing.
3. There is no reason why SLQB can't use higher order allocations if this
   is a significant win.


> Memory policy handling in the slab
> allocator.

I see no reason why this should be a problem. The SLUB merge just asserted
it would be a problem. But actually SLAB seems to handle it just fine, and
SLUB also doesn't always obey memory policies, so I consider that to be a
worse problem, at least until it is justified by performance numbers that
show otherwise.


> Even seems to include periodic moving of objects between
> queues.

The queues expire slowly. Same as SLAB's arrays. You are describing the
implementation, and not the problems it has.


> The NUMA stuff is still a bit foggy to me since it seems to assume
> a mapping between cpus and nodes. There are cpuless nodes as well as
> memoryless cpus.

That needs a little bit of work, but my primary focus is to come up
with a design that has competitive performance in the most important
cases.

There needs to be some fallback cases added to slowpaths to handle
these things, but I don't see why it would take much work.

 
> SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> Solaris allocator from which we got the design for all the queuing stuff
> in the first place). It preserves many of the drawbacks of that code.

It is _like_ slab. It avoids the major drawbacks of large footprint of
array caches, and O(N^2) memory consumption behaviour, and corner cases
where scalability is poor. The queueing behaviour of SLAB IMO is not
a drawback and it is a big reaon why SLAB is so good.

 
> If SLQB would replace SLAB then there would be a lot of shared code
> (debugging for example). Having a generic slab allocator framework may
> then be possible within which a variety of algorithms may be implemented.

The goal is to replace SLAB and SLUB. Anything less would be a failure
on behalf of SLQB. Shared code is not a bad thing, but the major problem
is the actual core behaviour of the allocator because it affects almost
everywhere in the kernel and splitting userbase is not a good thing.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-15  6:19                       ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-15  6:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:40:12PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
> 
> > Well if you would like to consider SLQB as a fix for SLUB, that's
> > fine by me ;) Actually I guess it is a valid way to look at the problem:
> > SLQB solves the OLTP regression, so the only question is "what is the
> > downside of it?".
> 
> The downside is that it brings the SLAB stuff back that SLUB was
> designed to avoid. Queue expiration.

What's this mean? Something distinct from periodic timer?

> The use of timers to expire at
> uncontrollable intervals for user space.

I am not convinced this is a problem. I would like to see evidence
that it is a problem, but I have only seen assertions.

Definitely it is not uncontrollable. And not unchangeable. It is
about the least sensitive part of the allocator because in a serious
workload, the queues will continually be bounded by watermarks rather
than timer reaping.


> Object dispersal
> in the kernel address space.

You mean due to lower order allocations?
1. I have not seen any results showing this gives a practical performance
   increase, let alone one that offsets the downsides of using higher
   order allocations.
2. Increased internal fragmentation may also have the opposite effect and
   result in worse packing.
3. There is no reason why SLQB can't use higher order allocations if this
   is a significant win.


> Memory policy handling in the slab
> allocator.

I see no reason why this should be a problem. The SLUB merge just asserted
it would be a problem. But actually SLAB seems to handle it just fine, and
SLUB also doesn't always obey memory policies, so I consider that to be a
worse problem, at least until it is justified by performance numbers that
show otherwise.


> Even seems to include periodic moving of objects between
> queues.

The queues expire slowly. Same as SLAB's arrays. You are describing the
implementation, and not the problems it has.


> The NUMA stuff is still a bit foggy to me since it seems to assume
> a mapping between cpus and nodes. There are cpuless nodes as well as
> memoryless cpus.

That needs a little bit of work, but my primary focus is to come up
with a design that has competitive performance in the most important
cases.

There needs to be some fallback cases added to slowpaths to handle
these things, but I don't see why it would take much work.

 
> SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> Solaris allocator from which we got the design for all the queuing stuff
> in the first place). It preserves many of the drawbacks of that code.

It is _like_ slab. It avoids the major drawbacks of large footprint of
array caches, and O(N^2) memory consumption behaviour, and corner cases
where scalability is poor. The queueing behaviour of SLAB IMO is not
a drawback and it is a big reaon why SLAB is so good.

 
> If SLQB would replace SLAB then there would be a lot of shared code
> (debugging for example). Having a generic slab allocator framework may
> then be possible within which a variety of algorithms may be implemented.

The goal is to replace SLAB and SLUB. Anything less would be a failure
on behalf of SLQB. Shared code is not a bad thing, but the major problem
is the actual core behaviour of the allocator because it affects almost
everywhere in the kernel and splitting userbase is not a good thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 18:01               ` Christoph Lameter
@ 2009-01-15  6:03                 ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-15  6:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:01:32PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
> 
> > Right, but that regression isn't my only problem with SLUB. I think
> > higher order allocations could be much more damaging for more a wider
> > class of users. It is less common to see higher order allocation failure
> > reports in places other than lkml, where people tend to have systems
> > stay up longer and/or do a wider range of things with them.
> 
> The higher orders can fail and will then result in the allocator doing
> order 0 allocs. It is not a failure condition.

But they increase pressure on the resource and reduce availability to
other higher order allocations. They accelerate the breakdown of the
anti-frag heuristics, and they make slab internal fragmentation worse.
They also simply cost more to allocate and free and reclaim.


> Higher orders are an
> advantage because they localize variables of the same type and therefore
> reduce TLB pressure.

They are also a disadvantage. The disadvantages are very real. The
advantage is a bit theoretical (how much really is it going to help
going from 4K to 32K, if you still have hundreds or thousands of
slabs anyway?). Also, there is no reason why the other allocators
cannot use higher orer allocations, but their big advantage is that
they don't need to.

 
> > The idea of removing queues doesn't seem so good to me. Queues are good.
> > You amortize or avoid all sorts of things with queues. We have them
> > everywhere in the kernel ;)
> 
> Queues require maintenance which introduces variability because queue
> cleaning has to be done periodically and the queues grow in number if NUMA
> scenarios have to be handled effectively. This is a big problem for low
> latency applications (like in HPC). Spending far too much time optimizing
> queue cleaning in SLAB lead to the SLUB idea.

I'd like to see any real numbers showing this is a problem. Queue
trimming in SLQB can easily be scaled or tweaked to change latency
characteristics. The fact is that it isn't a very critical or highly
tuned operation. It happens _very_ infrequently in the large scheme
of things, and could easily be changed if there is a problem.

What you have in SLUB IMO is not obviously better because it effectively
has sizeable queues in higher order partial and free pages and the
active page, which simply never get trimmed, AFAIKS. This can be harmful
for slab internal fragmentation as well in some situations.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-15  6:03                 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-15  6:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:01:32PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
> 
> > Right, but that regression isn't my only problem with SLUB. I think
> > higher order allocations could be much more damaging for more a wider
> > class of users. It is less common to see higher order allocation failure
> > reports in places other than lkml, where people tend to have systems
> > stay up longer and/or do a wider range of things with them.
> 
> The higher orders can fail and will then result in the allocator doing
> order 0 allocs. It is not a failure condition.

But they increase pressure on the resource and reduce availability to
other higher order allocations. They accelerate the breakdown of the
anti-frag heuristics, and they make slab internal fragmentation worse.
They also simply cost more to allocate and free and reclaim.


> Higher orders are an
> advantage because they localize variables of the same type and therefore
> reduce TLB pressure.

They are also a disadvantage. The disadvantages are very real. The
advantage is a bit theoretical (how much really is it going to help
going from 4K to 32K, if you still have hundreds or thousands of
slabs anyway?). Also, there is no reason why the other allocators
cannot use higher orer allocations, but their big advantage is that
they don't need to.

 
> > The idea of removing queues doesn't seem so good to me. Queues are good.
> > You amortize or avoid all sorts of things with queues. We have them
> > everywhere in the kernel ;)
> 
> Queues require maintenance which introduces variability because queue
> cleaning has to be done periodically and the queues grow in number if NUMA
> scenarios have to be handled effectively. This is a big problem for low
> latency applications (like in HPC). Spending far too much time optimizing
> queue cleaning in SLAB lead to the SLUB idea.

I'd like to see any real numbers showing this is a problem. Queue
trimming in SLQB can easily be scaled or tweaked to change latency
characteristics. The fact is that it isn't a very critical or highly
tuned operation. It happens _very_ infrequently in the large scheme
of things, and could easily be changed if there is a problem.

What you have in SLUB IMO is not obviously better because it effectively
has sizeable queues in higher order partial and free pages and the
active page, which simply never get trimmed, AFAIKS. This can be harmful
for slab internal fragmentation as well in some situations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 15:59                   ` Nick Piggin
@ 2009-01-14 18:40                     ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-14 18:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Well if you would like to consider SLQB as a fix for SLUB, that's
> fine by me ;) Actually I guess it is a valid way to look at the problem:
> SLQB solves the OLTP regression, so the only question is "what is the
> downside of it?".

The downside is that it brings the SLAB stuff back that SLUB was
designed to avoid. Queue expiration. The use of timers to expire at
uncontrollable intervals for user space. Object dispersal
in the kernel address space. Memory policy handling in the slab
allocator. Even seems to include periodic moving of objects between
queues. The NUMA stuff is still a bit foggy to me since it seems to assume
a mapping between cpus and nodes. There are cpuless nodes as well as
memoryless cpus.

SLQB maybe a good cleanup for SLAB. Its good that it is based on the
cleaned up code in SLUB but the fundamental design is SLAB (or rather the
Solaris allocator from which we got the design for all the queuing stuff
in the first place). It preserves many of the drawbacks of that code.

If SLQB would replace SLAB then there would be a lot of shared code
(debugging for example). Having a generic slab allocator framework may
then be possible within which a variety of algorithms may be implemented.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 18:40                     ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-14 18:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Well if you would like to consider SLQB as a fix for SLUB, that's
> fine by me ;) Actually I guess it is a valid way to look at the problem:
> SLQB solves the OLTP regression, so the only question is "what is the
> downside of it?".

The downside is that it brings the SLAB stuff back that SLUB was
designed to avoid. Queue expiration. The use of timers to expire at
uncontrollable intervals for user space. Object dispersal
in the kernel address space. Memory policy handling in the slab
allocator. Even seems to include periodic moving of objects between
queues. The NUMA stuff is still a bit foggy to me since it seems to assume
a mapping between cpus and nodes. There are cpuless nodes as well as
memoryless cpus.

SLQB maybe a good cleanup for SLAB. Its good that it is based on the
cleaned up code in SLUB but the fundamental design is SLAB (or rather the
Solaris allocator from which we got the design for all the queuing stuff
in the first place). It preserves many of the drawbacks of that code.

If SLQB would replace SLAB then there would be a lot of shared code
(debugging for example). Having a generic slab allocator framework may
then be possible within which a variety of algorithms may be implemented.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 15:09             ` Nick Piggin
@ 2009-01-14 18:01               ` Christoph Lameter
  -1 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-14 18:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure
> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.

The higher orders can fail and will then result in the allocator doing
order 0 allocs. It is not a failure condition. Higher orders are an
advantage because they localize variables of the same type and therefore
reduce TLB pressure.

> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)

Queues require maintenance which introduces variability because queue
cleaning has to be done periodically and the queues grow in number if NUMA
scenarios have to be handled effectively. This is a big problem for low
latency applications (like in HPC). Spending far too much time optimizing
queue cleaning in SLAB lead to the SLUB idea.

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 18:01               ` Christoph Lameter
  0 siblings, 0 replies; 197+ messages in thread
From: Christoph Lameter @ 2009-01-14 18:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Zhang, Yanmin, Lin Ming, linux-mm, linux-kernel,
	Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure
> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.

The higher orders can fail and will then result in the allocator doing
order 0 allocs. It is not a failure condition. Higher orders are an
advantage because they localize variables of the same type and therefore
reduce TLB pressure.

> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)

Queues require maintenance which introduces variability because queue
cleaning has to be done periodically and the queues grow in number if NUMA
scenarios have to be handled effectively. This is a big problem for low
latency applications (like in HPC). Spending far too much time optimizing
queue cleaning in SLAB lead to the SLUB idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 15:30                 ` Pekka Enberg
@ 2009-01-14 15:59                   ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:59 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 05:30:48PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > And... IIRC, the Intel guys did make a stink but it wasn't considered
> > so important or worthwhile to fix for some reason? Anyway, the fact is
> > that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> > failure of SLUB that it hasn't managed to replace SLAB by this point.
> 
> Again, not speaking for Christoph, but *I* do consider the regression
> to be important and I do want it to be fixed. I have asked for a test
> case to reproduce the regression and/or oprofile reports but have yet
> to receive them. I did fix one regression I saw with the fio benchmark
> but unfortunately it wasn't the same regression the Intel guys are
> hitting. I suppose we're in limbo now because the people who are
> affected by the regression can simply turn on CONFIG_SLAB.

Mmm. SLES11 will ship with CONFIG_SLAB, FWIW. No I actually didn't
make any input into the decision. And I have mixed feelings about
that because there are places where SLAB is better.

But I must say that SLAB seems to be a really good allocator, and
outside of some types of microbenchmarks where it would sometimes
be much slower than SLAB, SLAB was often my main performance competitor
and often very hard to match with SLQB, let alone beat. That's not
to say SLUB wasn't also often the faster of the two, but I was
surprised at how good SLAB is.

 
> In any case, I do agree that the inability to replace SLAB with SLUB
> is a failure on the latter. I'm just not totally convinced that it's
> because the SLUB code is unfixable ;).

Well if you would like to consider SLQB as a fix for SLUB, that's
fine by me ;) Actually I guess it is a valid way to look at the problem:
SLQB solves the OLTP regression, so the only question is "what is the
downside of it?".




^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 15:59                   ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:59 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 05:30:48PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > And... IIRC, the Intel guys did make a stink but it wasn't considered
> > so important or worthwhile to fix for some reason? Anyway, the fact is
> > that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> > failure of SLUB that it hasn't managed to replace SLAB by this point.
> 
> Again, not speaking for Christoph, but *I* do consider the regression
> to be important and I do want it to be fixed. I have asked for a test
> case to reproduce the regression and/or oprofile reports but have yet
> to receive them. I did fix one regression I saw with the fio benchmark
> but unfortunately it wasn't the same regression the Intel guys are
> hitting. I suppose we're in limbo now because the people who are
> affected by the regression can simply turn on CONFIG_SLAB.

Mmm. SLES11 will ship with CONFIG_SLAB, FWIW. No I actually didn't
make any input into the decision. And I have mixed feelings about
that because there are places where SLAB is better.

But I must say that SLAB seems to be a really good allocator, and
outside of some types of microbenchmarks where it would sometimes
be much slower than SLAB, SLAB was often my main performance competitor
and often very hard to match with SLQB, let alone beat. That's not
to say SLUB wasn't also often the faster of the two, but I was
surprised at how good SLAB is.

 
> In any case, I do agree that the inability to replace SLAB with SLUB
> is a failure on the latter. I'm just not totally convinced that it's
> because the SLUB code is unfixable ;).

Well if you would like to consider SLQB as a fix for SLUB, that's
fine by me ;) Actually I guess it is a valid way to look at the problem:
SLQB solves the OLTP regression, so the only question is "what is the
downside of it?".



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 15:22               ` Nick Piggin
@ 2009-01-14 15:30                 ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 15:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> And... IIRC, the Intel guys did make a stink but it wasn't considered
> so important or worthwhile to fix for some reason? Anyway, the fact is
> that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> failure of SLUB that it hasn't managed to replace SLAB by this point.

Again, not speaking for Christoph, but *I* do consider the regression
to be important and I do want it to be fixed. I have asked for a test
case to reproduce the regression and/or oprofile reports but have yet
to receive them. I did fix one regression I saw with the fio benchmark
but unfortunately it wasn't the same regression the Intel guys are
hitting. I suppose we're in limbo now because the people who are
affected by the regression can simply turn on CONFIG_SLAB.

In any case, I do agree that the inability to replace SLAB with SLUB
is a failure on the latter. I'm just not totally convinced that it's
because the SLUB code is unfixable ;).

                                Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 15:30                 ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 15:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> And... IIRC, the Intel guys did make a stink but it wasn't considered
> so important or worthwhile to fix for some reason? Anyway, the fact is
> that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> failure of SLUB that it hasn't managed to replace SLAB by this point.

Again, not speaking for Christoph, but *I* do consider the regression
to be important and I do want it to be fixed. I have asked for a test
case to reproduce the regression and/or oprofile reports but have yet
to receive them. I did fix one regression I saw with the fio benchmark
but unfortunately it wasn't the same regression the Intel guys are
hitting. I suppose we're in limbo now because the people who are
affected by the regression can simply turn on CONFIG_SLAB.

In any case, I do agree that the inability to replace SLAB with SLUB
is a failure on the latter. I'm just not totally convinced that it's
because the SLUB code is unfixable ;).

                                Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 15:09             ` Nick Piggin
@ 2009-01-14 15:22               ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hit send a bit too early here.

On Wed, Jan 14, 2009 at 04:09:00PM +0100, Nick Piggin wrote:
> On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> > 
> > Don't get me wrong, though. I am happy you are able to work with the
> > Intel engineers to fix the long standing issue (I want it fixed too!)
> > but I would be happier if the end-result was few simple patches
> > against mm/slub.c :-).
> 
> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure

Sorry, *more* common.


> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.
> 
> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)
> 
>  
> > On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > > it is simply never going to be trivial to discover performance regressions.
> > > So I don't think outright replacement is great either (consider if SLUB
> > > had replaced SLAB completely).
> > 
> > If you ask me, I wish we *had* removed SLAB so relevant people could
> > have made a huge stink out of it and the regression would have been
> > taken care quickly ;-).
> 
> Well, presumably the stink was made because we've been stuck with SLAB
> for 2 years for some reason. But it is not only that one but there were
> other regressions too. Point simply is that it would have been much
> harder for users to detect if there even is a regression, what with all
> the other changes happening.

It would have been harder to detect SLUB vs SLAB regressions if they
had been forced to bisect (which could lead eg. to CFS, or GSO), rather
than select between two allocators.

And... IIRC, the Intel guys did make a stink but it wasn't considered
so important or worthwhile to fix for some reason? Anyway, the fact is
that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
failure of SLUB that it hasn't managed to replace SLAB by this point.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 15:22               ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hit send a bit too early here.

On Wed, Jan 14, 2009 at 04:09:00PM +0100, Nick Piggin wrote:
> On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> > 
> > Don't get me wrong, though. I am happy you are able to work with the
> > Intel engineers to fix the long standing issue (I want it fixed too!)
> > but I would be happier if the end-result was few simple patches
> > against mm/slub.c :-).
> 
> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure

Sorry, *more* common.


> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.
> 
> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)
> 
>  
> > On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > > it is simply never going to be trivial to discover performance regressions.
> > > So I don't think outright replacement is great either (consider if SLUB
> > > had replaced SLAB completely).
> > 
> > If you ask me, I wish we *had* removed SLAB so relevant people could
> > have made a huge stink out of it and the regression would have been
> > taken care quickly ;-).
> 
> Well, presumably the stink was made because we've been stuck with SLAB
> for 2 years for some reason. But it is not only that one but there were
> other regressions too. Point simply is that it would have been much
> harder for users to detect if there even is a regression, what with all
> the other changes happening.

It would have been harder to detect SLUB vs SLAB regressions if they
had been forced to bisect (which could lead eg. to CFS, or GSO), rather
than select between two allocators.

And... IIRC, the Intel guys did make a stink but it wasn't considered
so important or worthwhile to fix for some reason? Anyway, the fact is
that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
failure of SLUB that it hasn't managed to replace SLAB by this point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 14:45           ` Pekka Enberg
@ 2009-01-14 15:09             ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:09 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > The problem is there was apparently no plan for resolving the SLAB vs SLUB
> > strategy. And then features and things were added to one or the other one.
> > But on the other hand, the SLUB experience was a success in a way because
> > there were a lot of performance regressions found and fixed after it was
> > merged, for example.
> 
> That's not completely true. I can't speak for Christoph, but the
> biggest problem I have is that I have _no way_ of reproducing or
> analyzing the regression. I've tried out various benchmarks I have
> access to but I haven't been able to find anything.
> 
> The hypothesis is that SLUB regresses because of kmalloc()/kfree()
> ping-pong between CPUs and as far as I understood, Christoph thinks we
> can improve SLUB with the per-cpu alloc patches and the freelist
> management rework.
> 
> Don't get me wrong, though. I am happy you are able to work with the
> Intel engineers to fix the long standing issue (I want it fixed too!)
> but I would be happier if the end-result was few simple patches
> against mm/slub.c :-).

Right, but that regression isn't my only problem with SLUB. I think
higher order allocations could be much more damaging for more a wider
class of users. It is less common to see higher order allocation failure
reports in places other than lkml, where people tend to have systems
stay up longer and/or do a wider range of things with them.

The idea of removing queues doesn't seem so good to me. Queues are good.
You amortize or avoid all sorts of things with queues. We have them
everywhere in the kernel ;)

 
> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > it is simply never going to be trivial to discover performance regressions.
> > So I don't think outright replacement is great either (consider if SLUB
> > had replaced SLAB completely).
> 
> If you ask me, I wish we *had* removed SLAB so relevant people could
> have made a huge stink out of it and the regression would have been
> taken care quickly ;-).

Well, presumably the stink was made because we've been stuck with SLAB
for 2 years for some reason. But it is not only that one but there were
other regressions too. Point simply is that it would have been much
harder for users to detect if there even is a regression, what with all
the other changes happening.



^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 15:09             ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 15:09 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > The problem is there was apparently no plan for resolving the SLAB vs SLUB
> > strategy. And then features and things were added to one or the other one.
> > But on the other hand, the SLUB experience was a success in a way because
> > there were a lot of performance regressions found and fixed after it was
> > merged, for example.
> 
> That's not completely true. I can't speak for Christoph, but the
> biggest problem I have is that I have _no way_ of reproducing or
> analyzing the regression. I've tried out various benchmarks I have
> access to but I haven't been able to find anything.
> 
> The hypothesis is that SLUB regresses because of kmalloc()/kfree()
> ping-pong between CPUs and as far as I understood, Christoph thinks we
> can improve SLUB with the per-cpu alloc patches and the freelist
> management rework.
> 
> Don't get me wrong, though. I am happy you are able to work with the
> Intel engineers to fix the long standing issue (I want it fixed too!)
> but I would be happier if the end-result was few simple patches
> against mm/slub.c :-).

Right, but that regression isn't my only problem with SLUB. I think
higher order allocations could be much more damaging for more a wider
class of users. It is less common to see higher order allocation failure
reports in places other than lkml, where people tend to have systems
stay up longer and/or do a wider range of things with them.

The idea of removing queues doesn't seem so good to me. Queues are good.
You amortize or avoid all sorts of things with queues. We have them
everywhere in the kernel ;)

 
> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > it is simply never going to be trivial to discover performance regressions.
> > So I don't think outright replacement is great either (consider if SLUB
> > had replaced SLAB completely).
> 
> If you ask me, I wish we *had* removed SLAB so relevant people could
> have made a huge stink out of it and the regression would have been
> taken care quickly ;-).

Well, presumably the stink was made because we've been stuck with SLAB
for 2 years for some reason. But it is not only that one but there were
other regressions too. Point simply is that it would have been much
harder for users to detect if there even is a regression, what with all
the other changes happening.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 14:22         ` Nick Piggin
@ 2009-01-14 14:45           ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 14:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> The problem is there was apparently no plan for resolving the SLAB vs SLUB
> strategy. And then features and things were added to one or the other one.
> But on the other hand, the SLUB experience was a success in a way because
> there were a lot of performance regressions found and fixed after it was
> merged, for example.

That's not completely true. I can't speak for Christoph, but the
biggest problem I have is that I have _no way_ of reproducing or
analyzing the regression. I've tried out various benchmarks I have
access to but I haven't been able to find anything.

The hypothesis is that SLUB regresses because of kmalloc()/kfree()
ping-pong between CPUs and as far as I understood, Christoph thinks we
can improve SLUB with the per-cpu alloc patches and the freelist
management rework.

Don't get me wrong, though. I am happy you are able to work with the
Intel engineers to fix the long standing issue (I want it fixed too!)
but I would be happier if the end-result was few simple patches
against mm/slub.c :-).

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> I'd love to be able to justify replacing SLAB and SLUB today, but actually
> it is simply never going to be trivial to discover performance regressions.
> So I don't think outright replacement is great either (consider if SLUB
> had replaced SLAB completely).

If you ask me, I wish we *had* removed SLAB so relevant people could
have made a huge stink out of it and the regression would have been
taken care quickly ;-).

                       Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 14:45           ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 14:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> The problem is there was apparently no plan for resolving the SLAB vs SLUB
> strategy. And then features and things were added to one or the other one.
> But on the other hand, the SLUB experience was a success in a way because
> there were a lot of performance regressions found and fixed after it was
> merged, for example.

That's not completely true. I can't speak for Christoph, but the
biggest problem I have is that I have _no way_ of reproducing or
analyzing the regression. I've tried out various benchmarks I have
access to but I haven't been able to find anything.

The hypothesis is that SLUB regresses because of kmalloc()/kfree()
ping-pong between CPUs and as far as I understood, Christoph thinks we
can improve SLUB with the per-cpu alloc patches and the freelist
management rework.

Don't get me wrong, though. I am happy you are able to work with the
Intel engineers to fix the long standing issue (I want it fixed too!)
but I would be happier if the end-result was few simple patches
against mm/slub.c :-).

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <npiggin@suse.de> wrote:
> I'd love to be able to justify replacing SLAB and SLUB today, but actually
> it is simply never going to be trivial to discover performance regressions.
> So I don't think outright replacement is great either (consider if SLUB
> had replaced SLAB completely).

If you ask me, I wish we *had* removed SLAB so relevant people could
have made a huge stink out of it and the regression would have been
taken care quickly ;-).

                       Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 13:44       ` Pekka Enberg
@ 2009-01-14 14:22         ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 14:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 03:44:44PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > The core allocator algorithms are so completely different that it is
> > obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> > support code and code structure). So it may as well be a patch against
> > SLAB.
> >
> > I will also prefer to maintain it myself because as I've said I don't
> > really agree with choices made in SLUB (and ergo SLUB developers don't
> > agree with SLQB).
> 
> Just for the record, I am only interesting in getting rid of SLAB (and
> SLOB if we can serve the embedded folks as well in the future, sorry
> Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
> with that. Judging from the SLAB -> SLUB experience, though, I am not
> so sure adding a completely separate allocator is the way to get
> there.

The problem is there was apparently no plan for resolving the SLAB vs SLUB
strategy. And then features and things were added to one or the other one.
But on the other hand, the SLUB experience was a success in a way because
there were a lot of performance regressions found and fixed after it was
merged, for example.

I'd love to be able to justify replacing SLAB and SLUB today, but actually
it is simply never going to be trivial to discover performance regressions.
So I don't think outright replacement is great either (consider if SLUB
had replaced SLAB completely).


> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > Note that I'm not trying to be nasty here. Of course I raised objections
> > to things I don't like, and I don't think I'm right by default. Just IMO
> > SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> > perfect.
> >
> > Also, I don't want to propose replacing any of the other allocators yet,
> > until more performance data is gathered. People need to compare each one.
> > SLQB definitely is not a clear winner in all tests. At the moment I want
> > to see healthy competition and hopefully one day decide on just one of
> > the main 3.
> 
> OK, then it is really up to Andrew and Linus to decide whether they
> want to merge it or not. I'm not violently against it, it's just that
> there's some maintenance overhead for API changes and for external
> code like kmemcheck, kmemtrace, and failslab, that need hooks in the
> slab allocator.

Sure. On the slab side, I would be happy to do that work.

We split the user base, which is a big problem if it drags out for years
like SLAB/SLUB. But if it is a very deliberate use of testing resources
in order to make progress on the issue, then that can be a positive.


 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > Cache colouring was just brought over from SLAB. prefetching was done
> > by looking at cache misses generally, and attempting to reduce them.
> > But you end up barely making a significant difference or just pushing
> > the cost elsewhere really. Down to the level of prefetching it is
> > going to hugely depend on the exact behaviour of the workload and
> > the allocator.
> 
> As far as I understood, the prefetch optimizations can produce
> unexpected results on some systems (yes, bit of hand-waving here), so
> I would consider ripping them out. Even if cache coloring isn't a huge
> win on most systems, it's probably not going to hurt either.

I hit problems on some microbenchmark where I was prefetching a NULL pointer in
some cases, which must have been causing the CPU to trap internally and alloc
overhead suddenly got much higher ;) 

Possibly sometimes if you prefetch too early or when various queues are
already full, then it ends up causing slowdowns too.

 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> > +   object = page->freelist;
> >> > +   page->freelist = get_freepointer(s, object);
> >> > +   if (page->freelist)
> >> > +           prefetchw(page->freelist);
> >>
> >> I don't understand this prefetchw(). Who exactly is going to be updating
> >> contents of page->freelist?
> >
> > Again, it is for the next allocation. This was shown to reduce cache
> > misses here in IIRC tbench, but I'm not sure if that translated to a
> > significant performance improvement.
> 
> I'm not sure why you would want to optimize for the next allocation. I
> mean, I'd expect us to optimize for the kmalloc() + do some work +
> kfree() case where prefetching is likely to hurt more than help. Not
> that I have any numbers on this.

That's true, OTOH if the time between allocations is large, then an
extra prefetch is just a small cost. If the time between them is
short, it might be a bigger win.


^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 14:22         ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 14:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 03:44:44PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > The core allocator algorithms are so completely different that it is
> > obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> > support code and code structure). So it may as well be a patch against
> > SLAB.
> >
> > I will also prefer to maintain it myself because as I've said I don't
> > really agree with choices made in SLUB (and ergo SLUB developers don't
> > agree with SLQB).
> 
> Just for the record, I am only interesting in getting rid of SLAB (and
> SLOB if we can serve the embedded folks as well in the future, sorry
> Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
> with that. Judging from the SLAB -> SLUB experience, though, I am not
> so sure adding a completely separate allocator is the way to get
> there.

The problem is there was apparently no plan for resolving the SLAB vs SLUB
strategy. And then features and things were added to one or the other one.
But on the other hand, the SLUB experience was a success in a way because
there were a lot of performance regressions found and fixed after it was
merged, for example.

I'd love to be able to justify replacing SLAB and SLUB today, but actually
it is simply never going to be trivial to discover performance regressions.
So I don't think outright replacement is great either (consider if SLUB
had replaced SLAB completely).


> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > Note that I'm not trying to be nasty here. Of course I raised objections
> > to things I don't like, and I don't think I'm right by default. Just IMO
> > SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> > perfect.
> >
> > Also, I don't want to propose replacing any of the other allocators yet,
> > until more performance data is gathered. People need to compare each one.
> > SLQB definitely is not a clear winner in all tests. At the moment I want
> > to see healthy competition and hopefully one day decide on just one of
> > the main 3.
> 
> OK, then it is really up to Andrew and Linus to decide whether they
> want to merge it or not. I'm not violently against it, it's just that
> there's some maintenance overhead for API changes and for external
> code like kmemcheck, kmemtrace, and failslab, that need hooks in the
> slab allocator.

Sure. On the slab side, I would be happy to do that work.

We split the user base, which is a big problem if it drags out for years
like SLAB/SLUB. But if it is a very deliberate use of testing resources
in order to make progress on the issue, then that can be a positive.


 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> > Cache colouring was just brought over from SLAB. prefetching was done
> > by looking at cache misses generally, and attempting to reduce them.
> > But you end up barely making a significant difference or just pushing
> > the cost elsewhere really. Down to the level of prefetching it is
> > going to hugely depend on the exact behaviour of the workload and
> > the allocator.
> 
> As far as I understood, the prefetch optimizations can produce
> unexpected results on some systems (yes, bit of hand-waving here), so
> I would consider ripping them out. Even if cache coloring isn't a huge
> win on most systems, it's probably not going to hurt either.

I hit problems on some microbenchmark where I was prefetching a NULL pointer in
some cases, which must have been causing the CPU to trap internally and alloc
overhead suddenly got much higher ;) 

Possibly sometimes if you prefetch too early or when various queues are
already full, then it ends up causing slowdowns too.

 
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> > +   object = page->freelist;
> >> > +   page->freelist = get_freepointer(s, object);
> >> > +   if (page->freelist)
> >> > +           prefetchw(page->freelist);
> >>
> >> I don't understand this prefetchw(). Who exactly is going to be updating
> >> contents of page->freelist?
> >
> > Again, it is for the next allocation. This was shown to reduce cache
> > misses here in IIRC tbench, but I'm not sure if that translated to a
> > significant performance improvement.
> 
> I'm not sure why you would want to optimize for the next allocation. I
> mean, I'd expect us to optimize for the kmalloc() + do some work +
> kfree() case where prefetching is likely to hurt more than help. Not
> that I have any numbers on this.

That's true, OTOH if the time between allocations is large, then an
extra prefetch is just a small cost. If the time between them is
short, it might be a bigger win.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 11:47     ` Nick Piggin
@ 2009-01-14 13:44       ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 13:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> The core allocator algorithms are so completely different that it is
> obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> support code and code structure). So it may as well be a patch against
> SLAB.
>
> I will also prefer to maintain it myself because as I've said I don't
> really agree with choices made in SLUB (and ergo SLUB developers don't
> agree with SLQB).

Just for the record, I am only interesting in getting rid of SLAB (and
SLOB if we can serve the embedded folks as well in the future, sorry
Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
with that. Judging from the SLAB -> SLUB experience, though, I am not
so sure adding a completely separate allocator is the way to get
there.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> Note that I'm not trying to be nasty here. Of course I raised objections
> to things I don't like, and I don't think I'm right by default. Just IMO
> SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> perfect.
>
> Also, I don't want to propose replacing any of the other allocators yet,
> until more performance data is gathered. People need to compare each one.
> SLQB definitely is not a clear winner in all tests. At the moment I want
> to see healthy competition and hopefully one day decide on just one of
> the main 3.

OK, then it is really up to Andrew and Linus to decide whether they
want to merge it or not. I'm not violently against it, it's just that
there's some maintenance overhead for API changes and for external
code like kmemcheck, kmemtrace, and failslab, that need hooks in the
slab allocator.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> One thing that puzzles me a bit is that in addition to the struct
>> kmem_cache_list caching, I also see things like cache coloring, avoiding
>> page allocator pass-through, and lots of prefetch hints in the code
>> which makes evaluating the performance differences quite difficult. If
>> these optimizations *are* a win, then why don't we add them to SLUB?
>
> I don't know. I don't have enough time of day to work on SLQB enough,
> let alone attempt to do all this for SLUB as well. Especially when I
> think there are fundamental problems with the basic design of it.
>
> None of those optimisations you mention really showed a noticable win
> anywhere (except avoiding page allocator pass-through perhaps, simply
> because that is not an optimisation, rather it would be a de-optimisation
> to *add* page allocator pass-through for SLQB, so maybe it would aslow
> down some loads).
>
> Cache colouring was just brought over from SLAB. prefetching was done
> by looking at cache misses generally, and attempting to reduce them.
> But you end up barely making a significant difference or just pushing
> the cost elsewhere really. Down to the level of prefetching it is
> going to hugely depend on the exact behaviour of the workload and
> the allocator.

As far as I understood, the prefetch optimizations can produce
unexpected results on some systems (yes, bit of hand-waving here), so
I would consider ripping them out. Even if cache coloring isn't a huge
win on most systems, it's probably not going to hurt either.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> A completely different topic is memory efficiency of SLQB. The current
>> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
>> usually quite close. With the introduction of kmemtrace, I'm hopeful
>> that we will be able to fix up many of the badly fitting allocations in
>> the kernel to narrow the gap between SLUB and SLOB even more and I worry
>> SLQB will take us back to the SLAB numbers.
>
> Fundamentally it is more like SLOB and SLUB in that it uses object
> pointers and can allocate down to very small sizes. It doesn't have
> O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
> I didn't look closely at memory efficiency, but I have no reason to
> think it would be a problem.

Right, that's nice to hear.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> > +/*
>> > + * slqb_page overloads struct page, and is used to manage some slob allocation
>> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
>> > + * we'll just define our own struct slqb_page type variant here.
>> > + */
>>
>> You say horrible mess, I say convenient. I think it's good that core vm
>> hackers who have no interest in the slab allocator can clearly see we're
>> overloading some of the struct page fields.
>
> Yeah, but you can't really. There are so many places that overload them
> for different things and don't tell you about it right in that file. But
> it mostly works because we have nice layering and compartmentalisation.
>
> Anyway IIRC my initial patches to do some of these conversions actually
> either put the definitions into mm_types.h or at least added references
> to them in mm_types.h. It is the better way to go really because you get
> better type checking and it is readable. You may say the horrible mess is
> readable. Barely. Imagine how it would be if we put everything in there.

Well, if we only had one slab allocator... But yeah, point taken.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> > +   object = page->freelist;
>> > +   page->freelist = get_freepointer(s, object);
>> > +   if (page->freelist)
>> > +           prefetchw(page->freelist);
>>
>> I don't understand this prefetchw(). Who exactly is going to be updating
>> contents of page->freelist?
>
> Again, it is for the next allocation. This was shown to reduce cache
> misses here in IIRC tbench, but I'm not sure if that translated to a
> significant performance improvement.

I'm not sure why you would want to optimize for the next allocation. I
mean, I'd expect us to optimize for the kmalloc() + do some work +
kfree() case where prefetching is likely to hurt more than help. Not
that I have any numbers on this.

                                       Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 13:44       ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 13:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> The core allocator algorithms are so completely different that it is
> obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> support code and code structure). So it may as well be a patch against
> SLAB.
>
> I will also prefer to maintain it myself because as I've said I don't
> really agree with choices made in SLUB (and ergo SLUB developers don't
> agree with SLQB).

Just for the record, I am only interesting in getting rid of SLAB (and
SLOB if we can serve the embedded folks as well in the future, sorry
Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
with that. Judging from the SLAB -> SLUB experience, though, I am not
so sure adding a completely separate allocator is the way to get
there.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
> Note that I'm not trying to be nasty here. Of course I raised objections
> to things I don't like, and I don't think I'm right by default. Just IMO
> SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> perfect.
>
> Also, I don't want to propose replacing any of the other allocators yet,
> until more performance data is gathered. People need to compare each one.
> SLQB definitely is not a clear winner in all tests. At the moment I want
> to see healthy competition and hopefully one day decide on just one of
> the main 3.

OK, then it is really up to Andrew and Linus to decide whether they
want to merge it or not. I'm not violently against it, it's just that
there's some maintenance overhead for API changes and for external
code like kmemcheck, kmemtrace, and failslab, that need hooks in the
slab allocator.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> One thing that puzzles me a bit is that in addition to the struct
>> kmem_cache_list caching, I also see things like cache coloring, avoiding
>> page allocator pass-through, and lots of prefetch hints in the code
>> which makes evaluating the performance differences quite difficult. If
>> these optimizations *are* a win, then why don't we add them to SLUB?
>
> I don't know. I don't have enough time of day to work on SLQB enough,
> let alone attempt to do all this for SLUB as well. Especially when I
> think there are fundamental problems with the basic design of it.
>
> None of those optimisations you mention really showed a noticable win
> anywhere (except avoiding page allocator pass-through perhaps, simply
> because that is not an optimisation, rather it would be a de-optimisation
> to *add* page allocator pass-through for SLQB, so maybe it would aslow
> down some loads).
>
> Cache colouring was just brought over from SLAB. prefetching was done
> by looking at cache misses generally, and attempting to reduce them.
> But you end up barely making a significant difference or just pushing
> the cost elsewhere really. Down to the level of prefetching it is
> going to hugely depend on the exact behaviour of the workload and
> the allocator.

As far as I understood, the prefetch optimizations can produce
unexpected results on some systems (yes, bit of hand-waving here), so
I would consider ripping them out. Even if cache coloring isn't a huge
win on most systems, it's probably not going to hurt either.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> A completely different topic is memory efficiency of SLQB. The current
>> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
>> usually quite close. With the introduction of kmemtrace, I'm hopeful
>> that we will be able to fix up many of the badly fitting allocations in
>> the kernel to narrow the gap between SLUB and SLOB even more and I worry
>> SLQB will take us back to the SLAB numbers.
>
> Fundamentally it is more like SLOB and SLUB in that it uses object
> pointers and can allocate down to very small sizes. It doesn't have
> O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
> I didn't look closely at memory efficiency, but I have no reason to
> think it would be a problem.

Right, that's nice to hear.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> > +/*
>> > + * slqb_page overloads struct page, and is used to manage some slob allocation
>> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
>> > + * we'll just define our own struct slqb_page type variant here.
>> > + */
>>
>> You say horrible mess, I say convenient. I think it's good that core vm
>> hackers who have no interest in the slab allocator can clearly see we're
>> overloading some of the struct page fields.
>
> Yeah, but you can't really. There are so many places that overload them
> for different things and don't tell you about it right in that file. But
> it mostly works because we have nice layering and compartmentalisation.
>
> Anyway IIRC my initial patches to do some of these conversions actually
> either put the definitions into mm_types.h or at least added references
> to them in mm_types.h. It is the better way to go really because you get
> better type checking and it is readable. You may say the horrible mess is
> readable. Barely. Imagine how it would be if we put everything in there.

Well, if we only had one slab allocator... But yeah, point taken.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <npiggin@suse.de> wrote:
>> > +   object = page->freelist;
>> > +   page->freelist = get_freepointer(s, object);
>> > +   if (page->freelist)
>> > +           prefetchw(page->freelist);
>>
>> I don't understand this prefetchw(). Who exactly is going to be updating
>> contents of page->freelist?
>
> Again, it is for the next allocation. This was shown to reduce cache
> misses here in IIRC tbench, but I'm not sure if that translated to a
> significant performance improvement.

I'm not sure why you would want to optimize for the next allocation. I
mean, I'd expect us to optimize for the kmalloc() + do some work +
kfree() case where prefetching is likely to hurt more than help. Not
that I have any numbers on this.

                                       Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14 10:53   ` Pekka Enberg
@ 2009-01-14 11:47     ` Nick Piggin
  -1 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 11:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:53:18PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <npiggin@suse.de> wrote:
> > This is the latest SLQB patch. Since last time, we have imported the sysfs
> > framework from SLUB, and added specific event counters things for SLQB. I
> > had initially been somewhat against this because it makes SLQB depend on
> > another complex subsystem (which itself depends back on the slab allocator).
> > But I guess it is not fundamentally different than /proc, and there needs to
> > be some reporting somewhere. The individual per-slab counters really do make
> > performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> > file, which is a parser adapted from slabinfo.c for SLUB.
> >
> > Fixed some bugs, including a nasty one that was causing remote objects to
> > sneak onto local freelist, which would mean NUMA allocation was basically
> > broken.
> >
> > The NUMA side of things is now much more complete. NUMA policies are obeyed.
> > There is still a known bug where it won't run on a system with CPU-only
> > nodes.
> >
> > CONFIG options are improved.
> >
> > Credit to some of the engineers at Intel for helping run tests, contributing
> > ideas and patches to improve performance and fix bugs.
> >
> > I think it is getting to the point where it is stable and featureful. It
> > really needs to be further proven in the performance area. We'd welcome
> > any performance results or suggestions for tests to run.
> >
> > After this round of review/feedback, I plan to set about getting SLQB merged.
> 
> The code looks sane but I am still bit unhappy it's not a patchset on top of
> SLUB. We've discussed this in the past and you mentioned that the design is
> "completely different." Looking at it, I don't see any fundamental reason we
> can't do a struct kmem_cache_list layer on top of SLUB which would make
> merging of all this much less painful. I mean, at least in the past Linus hasn't
> been too keen on adding yet another slab allocator to the kernel and I must
> say judging from the SLAB -> SLUB experience, I'm not looking forward to it
> either.

Well SLUB has all this stuff in it to attempt to make it "unqueued", or
semi unqueued. None of that is required with SLQB; after the object
queues go away, the rest of SLQB is little more than a per-CPU SLOB with
individual slabs. But also has important differences. It is per-cpu, obeys
NUMA policies strongly, frees unused pages immediately (after they drop
off the object lists done via periodic reaping). Another one of the major
things I specifically avoid for example is higher order allocations.

The core allocator algorithms are so completely different that it is
obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
support code and code structure). So it may as well be a patch against
SLAB.

I will also prefer to maintain it myself because as I've said I don't
really agree with choices made in SLUB (and ergo SLUB developers don't
agree with SLQB).

Note that I'm not trying to be nasty here. Of course I raised objections
to things I don't like, and I don't think I'm right by default. Just IMO
SLUB has some problems. As do SLAB and SLQB of course. Nothing is
perfect.

Also, I don't want to propose replacing any of the other allocators yet,
until more performance data is gathered. People need to compare each one.
SLQB definitely is not a clear winner in all tests. At the moment I want
to see healthy competition and hopefully one day decide on just one of
the main 3.


> Also, to merge this, we need to see numbers. I assume SLQB fixes the
> long-standing SLUB vs. SLAB regression reported by Intel and doesn't
> introduce new performance regressions? Also, it would be nice for me to
> be able to reproduce the numbers, especially for those tests where SLUB
> performs worse.

It is comparable to SLAB on Intel's OLTP test. I don't know exactly
where SLUB lies, but I think it is several % below that.

No big obvious new regressions yet, but of course we won't know that
without a lot more testing. SLQB isn't outright winner in all cases.
For example, on machine A, tbench may be faster with SLAB, but on
machine B it turns out to be faster on SLQB. Another test might show
SLUB is better.

 
> One thing that puzzles me a bit is that in addition to the struct
> kmem_cache_list caching, I also see things like cache coloring, avoiding
> page allocator pass-through, and lots of prefetch hints in the code
> which makes evaluating the performance differences quite difficult. If
> these optimizations *are* a win, then why don't we add them to SLUB?

I don't know. I don't have enough time of day to work on SLQB enough,
let alone attempt to do all this for SLUB as well. Especially when I
think there are fundamental problems with the basic design of it.

None of those optimisations you mention really showed a noticable win
anywhere (except avoiding page allocator pass-through perhaps, simply
because that is not an optimisation, rather it would be a de-optimisation
to *add* page allocator pass-through for SLQB, so maybe it would aslow
down some loads).

Cache colouring was just brought over from SLAB. prefetching was done
by looking at cache misses generally, and attempting to reduce them.
But you end up barely making a significant difference or just pushing
the cost elsewhere really. Down to the level of prefetching it is
going to hugely depend on the exact behaviour of the workload and
the allocator.


> A completely different topic is memory efficiency of SLQB. The current
> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
> usually quite close. With the introduction of kmemtrace, I'm hopeful
> that we will be able to fix up many of the badly fitting allocations in
> the kernel to narrow the gap between SLUB and SLOB even more and I worry
> SLQB will take us back to the SLAB numbers.

Fundamentally it is more like SLOB and SLUB in that it uses object
pointers and can allocate down to very small sizes. It doesn't have
O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
I didn't look closely at memory efficiency, but I have no reason to
think it would be a problem.


> > +/*
> > + * Primary per-cpu, per-kmem_cache structure.
> > + */
> > +struct kmem_cache_cpu {
> > +	struct kmem_cache_list list; /* List for node-local slabs. */
> > +
> > +	unsigned int colour_next;
> 
> Do you see a performance improvement with cache coloring? IIRC,
> Christoph has stated in the past that SLUB doesn't do it because newer
> associative cache designs take care of the issue.

No I haven't seen an improvement.

> > +/*
> > + * Constant size allocations use this path to find index into kmalloc caches
> > + * arrays. get_slab() function is used for non-constant sizes.
> > + */
> > +static __always_inline int kmalloc_index(size_t size)
> > +{
> > +	if (unlikely(!size))
> > +		return 0;
> > +	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> > +		return 0;
> 
> SLUB doesn't have the above check. Does it fix an actual bug? Should we
> add that to SLUB as well?

I think it is OK because of page allocator passthrough.

 
> > +	if (size <=	 64) return 6;
> > +	if (size <=	128) return 7;
> > +	if (size <=	256) return 8;
> > +	if (size <=	512) return 9;
> > +	if (size <=       1024) return 10;
> > +	if (size <=   2 * 1024) return 11;
> > +	if (size <=   4 * 1024) return 12;
> > +	if (size <=   8 * 1024) return 13;
> > +	if (size <=  16 * 1024) return 14;
> > +	if (size <=  32 * 1024) return 15;
> > +	if (size <=  64 * 1024) return 16;
> > +	if (size <= 128 * 1024) return 17;
> > +	if (size <= 256 * 1024) return 18;
> > +	if (size <= 512 * 1024) return 19;
> > +	if (size <= 1024 * 1024) return 20;
> > +	if (size <=  2 * 1024 * 1024) return 21;
> > +	return -1;
> 
> I suppose we could just make this one return zero and drop the above
> check?

I guess so... although this is for the constant folded path anyway,
so efficiency is not an issue.

 
> > +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> > +
> > +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> > +{
> 
> So no page allocator pass-through, why is that? Looking at commit
> aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
> page size or higher kmalloc requests"), I'd assume SQLB would get many
> of the same benefits as well? It seems like a bad idea to hang on onto
> large chuncks of pages in caches, no?

I don't think so. From that commit:

   Advantages:
    - Reduces memory overhead for kmalloc array

Fair point. But I'm attempting to compete primarily with SLAB than SLOB.

    - Large kmalloc operations are faster since they do not
      need to pass through the slab allocator to get to the
      page allocator.

SLQB is faster than the page allocator.

    - Performance increase of 10%-20% on alloc and 50% on free for
      PAGE_SIZEd allocations.
      SLUB must call page allocator for each alloc anyways since
      the higher order pages which that allowed avoiding the page alloc calls
      are not available in a reliable way anymore. So we are basically removing
      useless slab allocator overhead.

SLQB is more like SLAB in this regard so it doesn't have this problme.

    - Large kmallocs yields page aligned object which is what
      SLAB did. Bad things like using page sized kmalloc allocations to
      stand in for page allocate allocs can be transparently handled and are not
      distinguishable from page allocator uses.

I don't understand this one. Definitely SLQB should give page aligned
objects for large kmallocs too.

    - Checking for too large objects can be removed since
      it is done by the page allocator.

But the check is made for size > PAGE_SIZE anyway, so I don't see the
win.

    Drawbacks:
    - No accounting for large kmalloc slab allocations anymore
    - No debugging of large kmalloc slab allocations.

And doesn't suffer these drawbacks either of course.

> > +/*
> > + * slqb_page overloads struct page, and is used to manage some slob allocation
> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
> > + */
> 
> You say horrible mess, I say convenient. I think it's good that core vm
> hackers who have no interest in the slab allocator can clearly see we're
> overloading some of the struct page fields.

Yeah, but you can't really. There are so many places that overload them
for different things and don't tell you about it right in that file. But
it mostly works because we have nice layering and compartmentalisation.

Anyway IIRC my initial patches to do some of these conversions actually
either put the definitions into mm_types.h or at least added references
to them in mm_types.h. It is the better way to go really because you get
better type checking and it is readable. You may say the horrible mess is
readable. Barely. Imagine how it would be if we put everything in there.


> But as SLOB does it like
> this as well, I suppose we can keep it as-is.

I added that ;)

 
> > +struct slqb_page {
> > +	union {
> > +		struct {
> > +			unsigned long flags;	/* mandatory */
> > +			atomic_t _count;	/* mandatory */
> > +			unsigned int inuse;	/* Nr of objects */
> > +		   	struct kmem_cache_list *list; /* Pointer to list */
> > +			void **freelist;	/* freelist req. slab lock */
> > +			union {
> > +				struct list_head lru; /* misc. list */
> > +				struct rcu_head rcu_head; /* for rcu freeing */
> > +			};
> > +		};
> > +		struct page page;
> > +	};
> > +};
> > +static inline void struct_slqb_page_wrong_size(void)
> > +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> > +
> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
> 
> Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
> kernel on an UMA machine, let them suffer?

Distros, mainly. SLAB does the same thing of course. There is a tiny
downside for the NUMA case (not measurable, but obviously another branch).
Not worth another config option, although I guess there could be a
config option to basically say "this config is exactly my machine; not
the maximum capabilities of a machine intended to run on this kernel".
That could be useful to everyone, including here.


> And if we *do* need to do this, can we move numa_platform() logic out of
> the memory allocator?

Possible. If it is moved out of SLAB it would make my life (slightly)
easier

 
> > +#ifdef CONFIG_SMP
> > +/*
> > + * If enough objects have been remotely freed back to this list,
> > + * remote_free_check will be set. In which case, we'll eventually come here
> > + * to take those objects off our remote_free list and onto our LIFO freelist.
> > + *
> > + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> > + * list_lock in the case of per-node list.
> > + */
> > +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > +	void **head, **tail;
> > +	int nr;
> > +
> > +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> > +
> > +	if (!l->remote_free.list.nr)
> > +		return;
> > +
> > +	l->remote_free_check = 0;
> > +	head = l->remote_free.list.head;
> > +	prefetchw(head);
> 
> So this prefetchw() is for flush_free_list(), right? A comment would be
> nice.

Either the flush or the next allocation, whichever comes first.

Added a comment.
 

> > +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > +	void *object;
> > +
> > +	object = l->freelist.head;
> > +	if (likely(object)) {
> > +		void *next = get_freepointer(s, object);
> > +		VM_BUG_ON(!l->freelist.nr);
> > +		l->freelist.nr--;
> > +		l->freelist.head = next;
> > +		if (next)
> > +			prefetchw(next);
> 
> Why do we need this prefetchw() here?

For the next allocation call. But TBH I have not seen a significant
difference in any test. I alternate from commenting it out and not.
I guess when in doubt there should be less code and complexity...

> > +			if (next)
> > +				prefetchw(next);
> 
> Or here?

Ditto.

 
> > +
> > +	object = page->freelist;
> > +	page->freelist = get_freepointer(s, object);
> > +	if (page->freelist)
> > +		prefetchw(page->freelist);
> 
> I don't understand this prefetchw(). Who exactly is going to be updating
> contents of page->freelist?

Again, it is for the next allocation. This was shown to reduce cache
misses here in IIRC tbench, but I'm not sure if that translated to a
significant performance improvement.

An alternate approach I have is a patch called "batchfeed", which
basically loads the entire page freelist in this path. But it cost
complexity and the last free word in struct page (which could be
gained back at the cost of yet more complexity). So I'm still on
the fence with this. I will have to take reports of regressions and
see if things like this help.


> > +/*
> > + * Perform some interrupts-on processing around the main allocation path
> > + * (debug checking and memset()ing).
> > + */
> > +static __always_inline void *slab_alloc(struct kmem_cache *s,
> > +		gfp_t gfpflags, int node, void *addr)
> > +{
> > +	void *object;
> > +	unsigned long flags;
> > +
> > +again:
> > +	local_irq_save(flags);
> > +	object = __slab_alloc(s, gfpflags, node);
> > +	local_irq_restore(flags);
> > +
> 
> As a cleanup, you could just do:
> 
>     if (unlikely(object == NULL))
>             return NULL;
> 
> here to avoid the double comparison. Maybe it even generates better asm.

Sometimes the stupid compiler loads a new literal to return with code
like this. I'll see.

> 
> > +	if (unlikely(slab_debug(s)) && likely(object)) {
> > +		if (unlikely(!alloc_debug_processing(s, object, addr)))
> > +			goto again;
> > +	}
> > +
> > +	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> > +		memset(object, 0, s->objsize);
> > +
> > +	return object;
> > +}
> > +
> > +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > +	int node = -1;
> > +#ifdef CONFIG_NUMA
> > +	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> > +		node = alternate_nid(s, gfpflags, node);
> > +#endif
> > +	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
> 
> The return address is wrong when kmem_cache_alloc() is called through
> __kmalloc().

Ah, good catch.

 
> As a side note, you can use the shorter _RET_IP_ instead of
> builtin_return_address(0) everywhere.

OK.

Thanks for the comments and discussion so far.

Nick

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 11:47     ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14 11:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

On Wed, Jan 14, 2009 at 12:53:18PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <npiggin@suse.de> wrote:
> > This is the latest SLQB patch. Since last time, we have imported the sysfs
> > framework from SLUB, and added specific event counters things for SLQB. I
> > had initially been somewhat against this because it makes SLQB depend on
> > another complex subsystem (which itself depends back on the slab allocator).
> > But I guess it is not fundamentally different than /proc, and there needs to
> > be some reporting somewhere. The individual per-slab counters really do make
> > performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> > file, which is a parser adapted from slabinfo.c for SLUB.
> >
> > Fixed some bugs, including a nasty one that was causing remote objects to
> > sneak onto local freelist, which would mean NUMA allocation was basically
> > broken.
> >
> > The NUMA side of things is now much more complete. NUMA policies are obeyed.
> > There is still a known bug where it won't run on a system with CPU-only
> > nodes.
> >
> > CONFIG options are improved.
> >
> > Credit to some of the engineers at Intel for helping run tests, contributing
> > ideas and patches to improve performance and fix bugs.
> >
> > I think it is getting to the point where it is stable and featureful. It
> > really needs to be further proven in the performance area. We'd welcome
> > any performance results or suggestions for tests to run.
> >
> > After this round of review/feedback, I plan to set about getting SLQB merged.
> 
> The code looks sane but I am still bit unhappy it's not a patchset on top of
> SLUB. We've discussed this in the past and you mentioned that the design is
> "completely different." Looking at it, I don't see any fundamental reason we
> can't do a struct kmem_cache_list layer on top of SLUB which would make
> merging of all this much less painful. I mean, at least in the past Linus hasn't
> been too keen on adding yet another slab allocator to the kernel and I must
> say judging from the SLAB -> SLUB experience, I'm not looking forward to it
> either.

Well SLUB has all this stuff in it to attempt to make it "unqueued", or
semi unqueued. None of that is required with SLQB; after the object
queues go away, the rest of SLQB is little more than a per-CPU SLOB with
individual slabs. But also has important differences. It is per-cpu, obeys
NUMA policies strongly, frees unused pages immediately (after they drop
off the object lists done via periodic reaping). Another one of the major
things I specifically avoid for example is higher order allocations.

The core allocator algorithms are so completely different that it is
obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
support code and code structure). So it may as well be a patch against
SLAB.

I will also prefer to maintain it myself because as I've said I don't
really agree with choices made in SLUB (and ergo SLUB developers don't
agree with SLQB).

Note that I'm not trying to be nasty here. Of course I raised objections
to things I don't like, and I don't think I'm right by default. Just IMO
SLUB has some problems. As do SLAB and SLQB of course. Nothing is
perfect.

Also, I don't want to propose replacing any of the other allocators yet,
until more performance data is gathered. People need to compare each one.
SLQB definitely is not a clear winner in all tests. At the moment I want
to see healthy competition and hopefully one day decide on just one of
the main 3.


> Also, to merge this, we need to see numbers. I assume SLQB fixes the
> long-standing SLUB vs. SLAB regression reported by Intel and doesn't
> introduce new performance regressions? Also, it would be nice for me to
> be able to reproduce the numbers, especially for those tests where SLUB
> performs worse.

It is comparable to SLAB on Intel's OLTP test. I don't know exactly
where SLUB lies, but I think it is several % below that.

No big obvious new regressions yet, but of course we won't know that
without a lot more testing. SLQB isn't outright winner in all cases.
For example, on machine A, tbench may be faster with SLAB, but on
machine B it turns out to be faster on SLQB. Another test might show
SLUB is better.

 
> One thing that puzzles me a bit is that in addition to the struct
> kmem_cache_list caching, I also see things like cache coloring, avoiding
> page allocator pass-through, and lots of prefetch hints in the code
> which makes evaluating the performance differences quite difficult. If
> these optimizations *are* a win, then why don't we add them to SLUB?

I don't know. I don't have enough time of day to work on SLQB enough,
let alone attempt to do all this for SLUB as well. Especially when I
think there are fundamental problems with the basic design of it.

None of those optimisations you mention really showed a noticable win
anywhere (except avoiding page allocator pass-through perhaps, simply
because that is not an optimisation, rather it would be a de-optimisation
to *add* page allocator pass-through for SLQB, so maybe it would aslow
down some loads).

Cache colouring was just brought over from SLAB. prefetching was done
by looking at cache misses generally, and attempting to reduce them.
But you end up barely making a significant difference or just pushing
the cost elsewhere really. Down to the level of prefetching it is
going to hugely depend on the exact behaviour of the workload and
the allocator.


> A completely different topic is memory efficiency of SLQB. The current
> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
> usually quite close. With the introduction of kmemtrace, I'm hopeful
> that we will be able to fix up many of the badly fitting allocations in
> the kernel to narrow the gap between SLUB and SLOB even more and I worry
> SLQB will take us back to the SLAB numbers.

Fundamentally it is more like SLOB and SLUB in that it uses object
pointers and can allocate down to very small sizes. It doesn't have
O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
I didn't look closely at memory efficiency, but I have no reason to
think it would be a problem.


> > +/*
> > + * Primary per-cpu, per-kmem_cache structure.
> > + */
> > +struct kmem_cache_cpu {
> > +	struct kmem_cache_list list; /* List for node-local slabs. */
> > +
> > +	unsigned int colour_next;
> 
> Do you see a performance improvement with cache coloring? IIRC,
> Christoph has stated in the past that SLUB doesn't do it because newer
> associative cache designs take care of the issue.

No I haven't seen an improvement.

> > +/*
> > + * Constant size allocations use this path to find index into kmalloc caches
> > + * arrays. get_slab() function is used for non-constant sizes.
> > + */
> > +static __always_inline int kmalloc_index(size_t size)
> > +{
> > +	if (unlikely(!size))
> > +		return 0;
> > +	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> > +		return 0;
> 
> SLUB doesn't have the above check. Does it fix an actual bug? Should we
> add that to SLUB as well?

I think it is OK because of page allocator passthrough.

 
> > +	if (size <=	 64) return 6;
> > +	if (size <=	128) return 7;
> > +	if (size <=	256) return 8;
> > +	if (size <=	512) return 9;
> > +	if (size <=       1024) return 10;
> > +	if (size <=   2 * 1024) return 11;
> > +	if (size <=   4 * 1024) return 12;
> > +	if (size <=   8 * 1024) return 13;
> > +	if (size <=  16 * 1024) return 14;
> > +	if (size <=  32 * 1024) return 15;
> > +	if (size <=  64 * 1024) return 16;
> > +	if (size <= 128 * 1024) return 17;
> > +	if (size <= 256 * 1024) return 18;
> > +	if (size <= 512 * 1024) return 19;
> > +	if (size <= 1024 * 1024) return 20;
> > +	if (size <=  2 * 1024 * 1024) return 21;
> > +	return -1;
> 
> I suppose we could just make this one return zero and drop the above
> check?

I guess so... although this is for the constant folded path anyway,
so efficiency is not an issue.

 
> > +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> > +
> > +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> > +{
> 
> So no page allocator pass-through, why is that? Looking at commit
> aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
> page size or higher kmalloc requests"), I'd assume SQLB would get many
> of the same benefits as well? It seems like a bad idea to hang on onto
> large chuncks of pages in caches, no?

I don't think so. From that commit:

   Advantages:
    - Reduces memory overhead for kmalloc array

Fair point. But I'm attempting to compete primarily with SLAB than SLOB.

    - Large kmalloc operations are faster since they do not
      need to pass through the slab allocator to get to the
      page allocator.

SLQB is faster than the page allocator.

    - Performance increase of 10%-20% on alloc and 50% on free for
      PAGE_SIZEd allocations.
      SLUB must call page allocator for each alloc anyways since
      the higher order pages which that allowed avoiding the page alloc calls
      are not available in a reliable way anymore. So we are basically removing
      useless slab allocator overhead.

SLQB is more like SLAB in this regard so it doesn't have this problme.

    - Large kmallocs yields page aligned object which is what
      SLAB did. Bad things like using page sized kmalloc allocations to
      stand in for page allocate allocs can be transparently handled and are not
      distinguishable from page allocator uses.

I don't understand this one. Definitely SLQB should give page aligned
objects for large kmallocs too.

    - Checking for too large objects can be removed since
      it is done by the page allocator.

But the check is made for size > PAGE_SIZE anyway, so I don't see the
win.

    Drawbacks:
    - No accounting for large kmalloc slab allocations anymore
    - No debugging of large kmalloc slab allocations.

And doesn't suffer these drawbacks either of course.

> > +/*
> > + * slqb_page overloads struct page, and is used to manage some slob allocation
> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
> > + */
> 
> You say horrible mess, I say convenient. I think it's good that core vm
> hackers who have no interest in the slab allocator can clearly see we're
> overloading some of the struct page fields.

Yeah, but you can't really. There are so many places that overload them
for different things and don't tell you about it right in that file. But
it mostly works because we have nice layering and compartmentalisation.

Anyway IIRC my initial patches to do some of these conversions actually
either put the definitions into mm_types.h or at least added references
to them in mm_types.h. It is the better way to go really because you get
better type checking and it is readable. You may say the horrible mess is
readable. Barely. Imagine how it would be if we put everything in there.


> But as SLOB does it like
> this as well, I suppose we can keep it as-is.

I added that ;)

 
> > +struct slqb_page {
> > +	union {
> > +		struct {
> > +			unsigned long flags;	/* mandatory */
> > +			atomic_t _count;	/* mandatory */
> > +			unsigned int inuse;	/* Nr of objects */
> > +		   	struct kmem_cache_list *list; /* Pointer to list */
> > +			void **freelist;	/* freelist req. slab lock */
> > +			union {
> > +				struct list_head lru; /* misc. list */
> > +				struct rcu_head rcu_head; /* for rcu freeing */
> > +			};
> > +		};
> > +		struct page page;
> > +	};
> > +};
> > +static inline void struct_slqb_page_wrong_size(void)
> > +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> > +
> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
> 
> Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
> kernel on an UMA machine, let them suffer?

Distros, mainly. SLAB does the same thing of course. There is a tiny
downside for the NUMA case (not measurable, but obviously another branch).
Not worth another config option, although I guess there could be a
config option to basically say "this config is exactly my machine; not
the maximum capabilities of a machine intended to run on this kernel".
That could be useful to everyone, including here.


> And if we *do* need to do this, can we move numa_platform() logic out of
> the memory allocator?

Possible. If it is moved out of SLAB it would make my life (slightly)
easier

 
> > +#ifdef CONFIG_SMP
> > +/*
> > + * If enough objects have been remotely freed back to this list,
> > + * remote_free_check will be set. In which case, we'll eventually come here
> > + * to take those objects off our remote_free list and onto our LIFO freelist.
> > + *
> > + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> > + * list_lock in the case of per-node list.
> > + */
> > +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > +	void **head, **tail;
> > +	int nr;
> > +
> > +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> > +
> > +	if (!l->remote_free.list.nr)
> > +		return;
> > +
> > +	l->remote_free_check = 0;
> > +	head = l->remote_free.list.head;
> > +	prefetchw(head);
> 
> So this prefetchw() is for flush_free_list(), right? A comment would be
> nice.

Either the flush or the next allocation, whichever comes first.

Added a comment.
 

> > +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > +	void *object;
> > +
> > +	object = l->freelist.head;
> > +	if (likely(object)) {
> > +		void *next = get_freepointer(s, object);
> > +		VM_BUG_ON(!l->freelist.nr);
> > +		l->freelist.nr--;
> > +		l->freelist.head = next;
> > +		if (next)
> > +			prefetchw(next);
> 
> Why do we need this prefetchw() here?

For the next allocation call. But TBH I have not seen a significant
difference in any test. I alternate from commenting it out and not.
I guess when in doubt there should be less code and complexity...

> > +			if (next)
> > +				prefetchw(next);
> 
> Or here?

Ditto.

 
> > +
> > +	object = page->freelist;
> > +	page->freelist = get_freepointer(s, object);
> > +	if (page->freelist)
> > +		prefetchw(page->freelist);
> 
> I don't understand this prefetchw(). Who exactly is going to be updating
> contents of page->freelist?

Again, it is for the next allocation. This was shown to reduce cache
misses here in IIRC tbench, but I'm not sure if that translated to a
significant performance improvement.

An alternate approach I have is a patch called "batchfeed", which
basically loads the entire page freelist in this path. But it cost
complexity and the last free word in struct page (which could be
gained back at the cost of yet more complexity). So I'm still on
the fence with this. I will have to take reports of regressions and
see if things like this help.


> > +/*
> > + * Perform some interrupts-on processing around the main allocation path
> > + * (debug checking and memset()ing).
> > + */
> > +static __always_inline void *slab_alloc(struct kmem_cache *s,
> > +		gfp_t gfpflags, int node, void *addr)
> > +{
> > +	void *object;
> > +	unsigned long flags;
> > +
> > +again:
> > +	local_irq_save(flags);
> > +	object = __slab_alloc(s, gfpflags, node);
> > +	local_irq_restore(flags);
> > +
> 
> As a cleanup, you could just do:
> 
>     if (unlikely(object == NULL))
>             return NULL;
> 
> here to avoid the double comparison. Maybe it even generates better asm.

Sometimes the stupid compiler loads a new literal to return with code
like this. I'll see.

> 
> > +	if (unlikely(slab_debug(s)) && likely(object)) {
> > +		if (unlikely(!alloc_debug_processing(s, object, addr)))
> > +			goto again;
> > +	}
> > +
> > +	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> > +		memset(object, 0, s->objsize);
> > +
> > +	return object;
> > +}
> > +
> > +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > +	int node = -1;
> > +#ifdef CONFIG_NUMA
> > +	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> > +		node = alternate_nid(s, gfpflags, node);
> > +#endif
> > +	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
> 
> The return address is wrong when kmem_cache_alloc() is called through
> __kmalloc().

Ah, good catch.

 
> As a side note, you can use the shorter _RET_IP_ instead of
> builtin_return_address(0) everywhere.

OK.

Thanks for the comments and discussion so far.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
  2009-01-14  9:04 ` Nick Piggin
@ 2009-01-14 10:53   ` Pekka Enberg
  -1 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 10:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <npiggin@suse.de> wrote:
> This is the latest SLQB patch. Since last time, we have imported the sysfs
> framework from SLUB, and added specific event counters things for SLQB. I
> had initially been somewhat against this because it makes SLQB depend on
> another complex subsystem (which itself depends back on the slab allocator).
> But I guess it is not fundamentally different than /proc, and there needs to
> be some reporting somewhere. The individual per-slab counters really do make
> performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> file, which is a parser adapted from slabinfo.c for SLUB.
>
> Fixed some bugs, including a nasty one that was causing remote objects to
> sneak onto local freelist, which would mean NUMA allocation was basically
> broken.
>
> The NUMA side of things is now much more complete. NUMA policies are obeyed.
> There is still a known bug where it won't run on a system with CPU-only
> nodes.
>
> CONFIG options are improved.
>
> Credit to some of the engineers at Intel for helping run tests, contributing
> ideas and patches to improve performance and fix bugs.
>
> I think it is getting to the point where it is stable and featureful. It
> really needs to be further proven in the performance area. We'd welcome
> any performance results or suggestions for tests to run.
>
> After this round of review/feedback, I plan to set about getting SLQB merged.

The code looks sane but I am still bit unhappy it's not a patchset on top of
SLUB. We've discussed this in the past and you mentioned that the design is
"completely different." Looking at it, I don't see any fundamental reason we
can't do a struct kmem_cache_list layer on top of SLUB which would make
merging of all this much less painful. I mean, at least in the past Linus hasn't
been too keen on adding yet another slab allocator to the kernel and I must
say judging from the SLAB -> SLUB experience, I'm not looking forward to it
either.

Also, to merge this, we need to see numbers. I assume SLQB fixes the
long-standing SLUB vs. SLAB regression reported by Intel and doesn't
introduce new performance regressions? Also, it would be nice for me to
be able to reproduce the numbers, especially for those tests where SLUB
performs worse.

One thing that puzzles me a bit is that in addition to the struct
kmem_cache_list caching, I also see things like cache coloring, avoiding
page allocator pass-through, and lots of prefetch hints in the code
which makes evaluating the performance differences quite difficult. If
these optimizations *are* a win, then why don't we add them to SLUB?

A completely different topic is memory efficiency of SLQB. The current
situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
usually quite close. With the introduction of kmemtrace, I'm hopeful
that we will be able to fix up many of the badly fitting allocations in
the kernel to narrow the gap between SLUB and SLOB even more and I worry
SLQB will take us back to the SLAB numbers.

> +/*
> + * Primary per-cpu, per-kmem_cache structure.
> + */
> +struct kmem_cache_cpu {
> +	struct kmem_cache_list list; /* List for node-local slabs. */
> +
> +	unsigned int colour_next;

Do you see a performance improvement with cache coloring? IIRC,
Christoph has stated in the past that SLUB doesn't do it because newer
associative cache designs take care of the issue.

> +/*
> + * Constant size allocations use this path to find index into kmalloc caches
> + * arrays. get_slab() function is used for non-constant sizes.
> + */
> +static __always_inline int kmalloc_index(size_t size)
> +{
> +	if (unlikely(!size))
> +		return 0;
> +	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> +		return 0;

SLUB doesn't have the above check. Does it fix an actual bug? Should we
add that to SLUB as well?

> +
> +	if (unlikely(size <= KMALLOC_MIN_SIZE))
> +		return KMALLOC_SHIFT_LOW;
> +
> +#if L1_CACHE_BYTES < 64
> +	if (size > 64 && size <= 96)
> +		return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> +	if (size > 128 && size <= 192)
> +		return 2;
> +#endif
> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;
> +	return -1;

I suppose we could just make this one return zero and drop the above
check?

> +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> +
> +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> +{

So no page allocator pass-through, why is that? Looking at commit
aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests"), I'd assume SQLB would get many
of the same benefits as well? It seems like a bad idea to hang on onto
large chuncks of pages in caches, no?

> +	if (__builtin_constant_p(size)) {
> +		struct kmem_cache *s;
> +
> +		s = kmalloc_slab(size, flags);
> +		if (unlikely(ZERO_OR_NULL_PTR(s)))
> +			return s;
> +
> +		return kmem_cache_alloc(s, flags);
> +	}
> +	return __kmalloc(size, flags);
> +}

> Index: linux-2.6/mm/slqb.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/mm/slqb.c
> @@ -0,0 +1,3368 @@
> +/*
> + * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
> + * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
> + * and freeing, but with a secondary goal of good remote freeing (freeing on
> + * another CPU from that which allocated).
> + *
> + * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/interrupt.h>
> +#include <linux/bitops.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +#include <linux/cpu.h>
> +#include <linux/cpuset.h>
> +#include <linux/mempolicy.h>
> +#include <linux/ctype.h>
> +#include <linux/kallsyms.h>
> +#include <linux/memory.h>
> +
> +static inline int slab_hiwater(struct kmem_cache *s)
> +{
> +	return s->hiwater;
> +}
> +
> +static inline int slab_freebatch(struct kmem_cache *s)
> +{
> +	return s->freebatch;
> +}
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.
> + */

You say horrible mess, I say convenient. I think it's good that core vm
hackers who have no interest in the slab allocator can clearly see we're
overloading some of the struct page fields. But as SLOB does it like
this as well, I suppose we can keep it as-is.

> +struct slqb_page {
> +	union {
> +		struct {
> +			unsigned long flags;	/* mandatory */
> +			atomic_t _count;	/* mandatory */
> +			unsigned int inuse;	/* Nr of objects */
> +		   	struct kmem_cache_list *list; /* Pointer to list */
> +			void **freelist;	/* freelist req. slab lock */
> +			union {
> +				struct list_head lru; /* misc. list */
> +				struct rcu_head rcu_head; /* for rcu freeing */
> +			};
> +		};
> +		struct page page;
> +	};
> +};
> +static inline void struct_slqb_page_wrong_size(void)
> +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> +
> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
kernel on an UMA machine, let them suffer?

And if we *do* need to do this, can we move numa_platform() logic out of
the memory allocator?

> +#ifdef CONFIG_SMP
> +/*
> + * If enough objects have been remotely freed back to this list,
> + * remote_free_check will be set. In which case, we'll eventually come here
> + * to take those objects off our remote_free list and onto our LIFO freelist.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	void **head, **tail;
> +	int nr;
> +
> +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> +
> +	if (!l->remote_free.list.nr)
> +		return;
> +
> +	l->remote_free_check = 0;
> +	head = l->remote_free.list.head;
> +	prefetchw(head);

So this prefetchw() is for flush_free_list(), right? A comment would be
nice.

> +
> +	spin_lock(&l->remote_free.lock);
> +	l->remote_free.list.head = NULL;
> +	tail = l->remote_free.list.tail;
> +	l->remote_free.list.tail = NULL;
> +	nr = l->remote_free.list.nr;
> +	l->remote_free.list.nr = 0;
> +	spin_unlock(&l->remote_free.lock);
> +
> +	if (!l->freelist.nr)
> +		l->freelist.head = head;
> +	else
> +		set_freepointer(s, l->freelist.tail, head);
> +	l->freelist.tail = tail;
> +
> +	l->freelist.nr += nr;
> +
> +	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
> +	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
> +}
> +#endif
> +
> +/*
> + * Allocation fastpath. Get an object from the list's LIFO freelist, or
> + * return NULL if it is empty.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	void *object;
> +
> +	object = l->freelist.head;
> +	if (likely(object)) {
> +		void *next = get_freepointer(s, object);
> +		VM_BUG_ON(!l->freelist.nr);
> +		l->freelist.nr--;
> +		l->freelist.head = next;
> +		if (next)
> +			prefetchw(next);

Why do we need this prefetchw() here?

> +		return object;
> +	}
> +	VM_BUG_ON(l->freelist.nr);
> +
> +#ifdef CONFIG_SMP
> +	if (unlikely(l->remote_free_check)) {
> +		claim_remote_free_list(s, l);
> +
> +		if (l->freelist.nr > slab_hiwater(s))
> +			flush_free_list(s, l);
> +
> +		/* repetition here helps gcc :( */
> +		object = l->freelist.head;
> +		if (likely(object)) {
> +			void *next = get_freepointer(s, object);
> +			VM_BUG_ON(!l->freelist.nr);
> +			l->freelist.nr--;
> +			l->freelist.head = next;
> +			if (next)
> +				prefetchw(next);

Or here?

> +			return object;
> +		}
> +		VM_BUG_ON(l->freelist.nr);
> +	}
> +#endif
> +
> +	return NULL;
> +}
> +
> +/*
> + * Slow(er) path. Get a page from this list's existing pages. Will be a
> + * new empty page in the case that __slab_alloc_page has just been called
> + * (empty pages otherwise never get queued up on the lists), or a partial page
> + * already on the list.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	struct slqb_page *page;
> +	void *object;
> +
> +	if (unlikely(!l->nr_partial))
> +		return NULL;
> +
> +	page = list_first_entry(&l->partial, struct slqb_page, lru);
> +	VM_BUG_ON(page->inuse == s->objects);
> +	if (page->inuse + 1 == s->objects) {
> +		l->nr_partial--;
> +		list_del(&page->lru);
> +/*XXX		list_move(&page->lru, &l->full); */
> +	}
> +
> +	VM_BUG_ON(!page->freelist);
> +
> +	page->inuse++;
> +
> +//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
> +
> +	object = page->freelist;
> +	page->freelist = get_freepointer(s, object);
> +	if (page->freelist)
> +		prefetchw(page->freelist);

I don't understand this prefetchw(). Who exactly is going to be updating
contents of page->freelist?

> +/*
> + * Perform some interrupts-on processing around the main allocation path
> + * (debug checking and memset()ing).
> + */
> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node, void *addr)
> +{
> +	void *object;
> +	unsigned long flags;
> +
> +again:
> +	local_irq_save(flags);
> +	object = __slab_alloc(s, gfpflags, node);
> +	local_irq_restore(flags);
> +

As a cleanup, you could just do:

    if (unlikely(object == NULL))
            return NULL;

here to avoid the double comparison. Maybe it even generates better asm.

> +	if (unlikely(slab_debug(s)) && likely(object)) {
> +		if (unlikely(!alloc_debug_processing(s, object, addr)))
> +			goto again;
> +	}
> +
> +	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> +		memset(object, 0, s->objsize);
> +
> +	return object;
> +}
> +
> +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	int node = -1;
> +#ifdef CONFIG_NUMA
> +	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> +		node = alternate_nid(s, gfpflags, node);
> +#endif
> +	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));

The return address is wrong when kmem_cache_alloc() is called through
__kmalloc().

As a side note, you can use the shorter _RET_IP_ instead of
builtin_return_address(0) everywhere.

                            Pekka

^ permalink raw reply	[flat|nested] 197+ messages in thread

* Re: [patch] SLQB slab allocator
@ 2009-01-14 10:53   ` Pekka Enberg
  0 siblings, 0 replies; 197+ messages in thread
From: Pekka Enberg @ 2009-01-14 10:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zhang, Yanmin, Lin Ming, Christoph Lameter, linux-mm,
	linux-kernel, Andrew Morton, Linus Torvalds

Hi Nick,

On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <npiggin@suse.de> wrote:
> This is the latest SLQB patch. Since last time, we have imported the sysfs
> framework from SLUB, and added specific event counters things for SLQB. I
> had initially been somewhat against this because it makes SLQB depend on
> another complex subsystem (which itself depends back on the slab allocator).
> But I guess it is not fundamentally different than /proc, and there needs to
> be some reporting somewhere. The individual per-slab counters really do make
> performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> file, which is a parser adapted from slabinfo.c for SLUB.
>
> Fixed some bugs, including a nasty one that was causing remote objects to
> sneak onto local freelist, which would mean NUMA allocation was basically
> broken.
>
> The NUMA side of things is now much more complete. NUMA policies are obeyed.
> There is still a known bug where it won't run on a system with CPU-only
> nodes.
>
> CONFIG options are improved.
>
> Credit to some of the engineers at Intel for helping run tests, contributing
> ideas and patches to improve performance and fix bugs.
>
> I think it is getting to the point where it is stable and featureful. It
> really needs to be further proven in the performance area. We'd welcome
> any performance results or suggestions for tests to run.
>
> After this round of review/feedback, I plan to set about getting SLQB merged.

The code looks sane but I am still bit unhappy it's not a patchset on top of
SLUB. We've discussed this in the past and you mentioned that the design is
"completely different." Looking at it, I don't see any fundamental reason we
can't do a struct kmem_cache_list layer on top of SLUB which would make
merging of all this much less painful. I mean, at least in the past Linus hasn't
been too keen on adding yet another slab allocator to the kernel and I must
say judging from the SLAB -> SLUB experience, I'm not looking forward to it
either.

Also, to merge this, we need to see numbers. I assume SLQB fixes the
long-standing SLUB vs. SLAB regression reported by Intel and doesn't
introduce new performance regressions? Also, it would be nice for me to
be able to reproduce the numbers, especially for those tests where SLUB
performs worse.

One thing that puzzles me a bit is that in addition to the struct
kmem_cache_list caching, I also see things like cache coloring, avoiding
page allocator pass-through, and lots of prefetch hints in the code
which makes evaluating the performance differences quite difficult. If
these optimizations *are* a win, then why don't we add them to SLUB?

A completely different topic is memory efficiency of SLQB. The current
situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
usually quite close. With the introduction of kmemtrace, I'm hopeful
that we will be able to fix up many of the badly fitting allocations in
the kernel to narrow the gap between SLUB and SLOB even more and I worry
SLQB will take us back to the SLAB numbers.

> +/*
> + * Primary per-cpu, per-kmem_cache structure.
> + */
> +struct kmem_cache_cpu {
> +	struct kmem_cache_list list; /* List for node-local slabs. */
> +
> +	unsigned int colour_next;

Do you see a performance improvement with cache coloring? IIRC,
Christoph has stated in the past that SLUB doesn't do it because newer
associative cache designs take care of the issue.

> +/*
> + * Constant size allocations use this path to find index into kmalloc caches
> + * arrays. get_slab() function is used for non-constant sizes.
> + */
> +static __always_inline int kmalloc_index(size_t size)
> +{
> +	if (unlikely(!size))
> +		return 0;
> +	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> +		return 0;

SLUB doesn't have the above check. Does it fix an actual bug? Should we
add that to SLUB as well?

> +
> +	if (unlikely(size <= KMALLOC_MIN_SIZE))
> +		return KMALLOC_SHIFT_LOW;
> +
> +#if L1_CACHE_BYTES < 64
> +	if (size > 64 && size <= 96)
> +		return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> +	if (size > 128 && size <= 192)
> +		return 2;
> +#endif
> +	if (size <=	  8) return 3;
> +	if (size <=	 16) return 4;
> +	if (size <=	 32) return 5;
> +	if (size <=	 64) return 6;
> +	if (size <=	128) return 7;
> +	if (size <=	256) return 8;
> +	if (size <=	512) return 9;
> +	if (size <=       1024) return 10;
> +	if (size <=   2 * 1024) return 11;
> +	if (size <=   4 * 1024) return 12;
> +	if (size <=   8 * 1024) return 13;
> +	if (size <=  16 * 1024) return 14;
> +	if (size <=  32 * 1024) return 15;
> +	if (size <=  64 * 1024) return 16;
> +	if (size <= 128 * 1024) return 17;
> +	if (size <= 256 * 1024) return 18;
> +	if (size <= 512 * 1024) return 19;
> +	if (size <= 1024 * 1024) return 20;
> +	if (size <=  2 * 1024 * 1024) return 21;
> +	return -1;

I suppose we could just make this one return zero and drop the above
check?

> +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> +
> +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> +{

So no page allocator pass-through, why is that? Looking at commit
aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests"), I'd assume SQLB would get many
of the same benefits as well? It seems like a bad idea to hang on onto
large chuncks of pages in caches, no?

> +	if (__builtin_constant_p(size)) {
> +		struct kmem_cache *s;
> +
> +		s = kmalloc_slab(size, flags);
> +		if (unlikely(ZERO_OR_NULL_PTR(s)))
> +			return s;
> +
> +		return kmem_cache_alloc(s, flags);
> +	}
> +	return __kmalloc(size, flags);
> +}

> Index: linux-2.6/mm/slqb.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/mm/slqb.c
> @@ -0,0 +1,3368 @@
> +/*
> + * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
> + * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
> + * and freeing, but with a secondary goal of good remote freeing (freeing on
> + * another CPU from that which allocated).
> + *
> + * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/interrupt.h>
> +#include <linux/bitops.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +#include <linux/cpu.h>
> +#include <linux/cpuset.h>
> +#include <linux/mempolicy.h>
> +#include <linux/ctype.h>
> +#include <linux/kallsyms.h>
> +#include <linux/memory.h>
> +
> +static inline int slab_hiwater(struct kmem_cache *s)
> +{
> +	return s->hiwater;
> +}
> +
> +static inline int slab_freebatch(struct kmem_cache *s)
> +{
> +	return s->freebatch;
> +}
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.
> + */

You say horrible mess, I say convenient. I think it's good that core vm
hackers who have no interest in the slab allocator can clearly see we're
overloading some of the struct page fields. But as SLOB does it like
this as well, I suppose we can keep it as-is.

> +struct slqb_page {
> +	union {
> +		struct {
> +			unsigned long flags;	/* mandatory */
> +			atomic_t _count;	/* mandatory */
> +			unsigned int inuse;	/* Nr of objects */
> +		   	struct kmem_cache_list *list; /* Pointer to list */
> +			void **freelist;	/* freelist req. slab lock */
> +			union {
> +				struct list_head lru; /* misc. list */
> +				struct rcu_head rcu_head; /* for rcu freeing */
> +			};
> +		};
> +		struct page page;
> +	};
> +};
> +static inline void struct_slqb_page_wrong_size(void)
> +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> +
> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
kernel on an UMA machine, let them suffer?

And if we *do* need to do this, can we move numa_platform() logic out of
the memory allocator?

> +#ifdef CONFIG_SMP
> +/*
> + * If enough objects have been remotely freed back to this list,
> + * remote_free_check will be set. In which case, we'll eventually come here
> + * to take those objects off our remote_free list and onto our LIFO freelist.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	void **head, **tail;
> +	int nr;
> +
> +	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> +
> +	if (!l->remote_free.list.nr)
> +		return;
> +
> +	l->remote_free_check = 0;
> +	head = l->remote_free.list.head;
> +	prefetchw(head);

So this prefetchw() is for flush_free_list(), right? A comment would be
nice.

> +
> +	spin_lock(&l->remote_free.lock);
> +	l->remote_free.list.head = NULL;
> +	tail = l->remote_free.list.tail;
> +	l->remote_free.list.tail = NULL;
> +	nr = l->remote_free.list.nr;
> +	l->remote_free.list.nr = 0;
> +	spin_unlock(&l->remote_free.lock);
> +
> +	if (!l->freelist.nr)
> +		l->freelist.head = head;
> +	else
> +		set_freepointer(s, l->freelist.tail, head);
> +	l->freelist.tail = tail;
> +
> +	l->freelist.nr += nr;
> +
> +	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
> +	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
> +}
> +#endif
> +
> +/*
> + * Allocation fastpath. Get an object from the list's LIFO freelist, or
> + * return NULL if it is empty.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	void *object;
> +
> +	object = l->freelist.head;
> +	if (likely(object)) {
> +		void *next = get_freepointer(s, object);
> +		VM_BUG_ON(!l->freelist.nr);
> +		l->freelist.nr--;
> +		l->freelist.head = next;
> +		if (next)
> +			prefetchw(next);

Why do we need this prefetchw() here?

> +		return object;
> +	}
> +	VM_BUG_ON(l->freelist.nr);
> +
> +#ifdef CONFIG_SMP
> +	if (unlikely(l->remote_free_check)) {
> +		claim_remote_free_list(s, l);
> +
> +		if (l->freelist.nr > slab_hiwater(s))
> +			flush_free_list(s, l);
> +
> +		/* repetition here helps gcc :( */
> +		object = l->freelist.head;
> +		if (likely(object)) {
> +			void *next = get_freepointer(s, object);
> +			VM_BUG_ON(!l->freelist.nr);
> +			l->freelist.nr--;
> +			l->freelist.head = next;
> +			if (next)
> +				prefetchw(next);

Or here?

> +			return object;
> +		}
> +		VM_BUG_ON(l->freelist.nr);
> +	}
> +#endif
> +
> +	return NULL;
> +}
> +
> +/*
> + * Slow(er) path. Get a page from this list's existing pages. Will be a
> + * new empty page in the case that __slab_alloc_page has just been called
> + * (empty pages otherwise never get queued up on the lists), or a partial page
> + * already on the list.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> +	struct slqb_page *page;
> +	void *object;
> +
> +	if (unlikely(!l->nr_partial))
> +		return NULL;
> +
> +	page = list_first_entry(&l->partial, struct slqb_page, lru);
> +	VM_BUG_ON(page->inuse == s->objects);
> +	if (page->inuse + 1 == s->objects) {
> +		l->nr_partial--;
> +		list_del(&page->lru);
> +/*XXX		list_move(&page->lru, &l->full); */
> +	}
> +
> +	VM_BUG_ON(!page->freelist);
> +
> +	page->inuse++;
> +
> +//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
> +
> +	object = page->freelist;
> +	page->freelist = get_freepointer(s, object);
> +	if (page->freelist)
> +		prefetchw(page->freelist);

I don't understand this prefetchw(). Who exactly is going to be updating
contents of page->freelist?

> +/*
> + * Perform some interrupts-on processing around the main allocation path
> + * (debug checking and memset()ing).
> + */
> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> +		gfp_t gfpflags, int node, void *addr)
> +{
> +	void *object;
> +	unsigned long flags;
> +
> +again:
> +	local_irq_save(flags);
> +	object = __slab_alloc(s, gfpflags, node);
> +	local_irq_restore(flags);
> +

As a cleanup, you could just do:

    if (unlikely(object == NULL))
            return NULL;

here to avoid the double comparison. Maybe it even generates better asm.

> +	if (unlikely(slab_debug(s)) && likely(object)) {
> +		if (unlikely(!alloc_debug_processing(s, object, addr)))
> +			goto again;
> +	}
> +
> +	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> +		memset(object, 0, s->objsize);
> +
> +	return object;
> +}
> +
> +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	int node = -1;
> +#ifdef CONFIG_NUMA
> +	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> +		node = alternate_nid(s, gfpflags, node);
> +#endif
> +	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));

The return address is wrong when kmem_cache_alloc() is called through
__kmalloc().

As a side note, you can use the shorter _RET_IP_ instead of
builtin_return_address(0) everywhere.

                            Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

* [patch] SLQB slab allocator
@ 2009-01-14  9:04 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14  9:04 UTC (permalink / raw)
  To: Zhang, Yanmin, Lin Ming, Christoph Lameter, Pekka Enberg,
	linux-mm, linux-kernel, Andrew Morton

Hi,

This is the latest SLQB patch. Since last time, we have imported the sysfs
framework from SLUB, and added specific event counters things for SLQB. I
had initially been somewhat against this because it makes SLQB depend on
another complex subsystem (which itself depends back on the slab allocator).
But I guess it is not fundamentally different than /proc, and there needs to
be some reporting somewhere. The individual per-slab counters really do make
performance analysis much easier. There is a Documentation/vm/slqbinfo.c
file, which is a parser adapted from slabinfo.c for SLUB.

Fixed some bugs, including a nasty one that was causing remote objects to
sneak onto local freelist, which would mean NUMA allocation was basically
broken.

The NUMA side of things is now much more complete. NUMA policies are obeyed.
There is still a known bug where it won't run on a system with CPU-only
nodes.

CONFIG options are improved.

Credit to some of the engineers at Intel for helping run tests, contributing
ideas and patches to improve performance and fix bugs.

I think it is getting to the point where it is stable and featureful. It
really needs to be further proven in the performance area. We'd welcome
any performance results or suggestions for tests to run.

After this round of review/feedback, I plan to set about getting SLQB merged.
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long nr;
+	void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t lock;
+	struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+	int remote_free_check;	/* remote_free has reached a watermark */
+#endif
+	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+	unsigned long nr_partial; /* Number of partial slabs (pages) */
+	struct list_head partial; /* Slabs which have some free objects */
+
+	unsigned long nr_slabs;	/* Total number of slabs allocated */
+
+	//struct list_head full;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list list; /* List for node-local slabs. */
+
+	unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist rlist;
+	struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list list;
+	spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long flags;
+	int hiwater;		/* LIFO list high watermark */
+	int freebatch;		/* LIFO freelist batch flush size */
+	int objsize;		/* The size of an object without meta data */
+	int offset;		/* Free pointer offset. */
+	int objects;		/* Number of objects in slab */
+
+	int size;		/* The size of an object including meta data */
+	int order;		/* Allocation order */
+	gfp_t allocflags;	/* gfp flags to use on allocation */
+	unsigned int colour_range;	/* range of colour counter */
+	unsigned int colour_off;		/* offset per colour */
+	void (*ctor)(void *);
+
+	const char *name;	/* Name (only for display!) */
+	struct list_head list;	/* List of slab caches */
+
+	int align;		/* Alignment */
+	int inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject kobj;	/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -828,6 +828,9 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -869,7 +872,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3368 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			unsigned int inuse;	/* Nr of objects */
+		   	struct kmem_cache_list *list; /* Pointer to list */
+			void **freelist;	/* freelist req. slab lock */
+			union {
+				struct list_head lru; /* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+					unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize,
+			s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size)
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE)
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+						void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset)
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+
+	if (s->flags & SLAB_STORE_USER)
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON) &&
+			(!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1) ||
+			 !check_bytes_and_report(s, page, p, "Poison",
+				p + s->objsize - 1, POISON_END, 1)))
+			return 0;
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+								void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str)
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+
+	if (*str == ',')
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+
+	slqb_debug = 0;
+	if (*str == '-')
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name,
+	void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+			{ return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	prefetchw(head);
+
+	spin_lock(&l->remote_free.lock);
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+	spin_unlock(&l->remote_free.lock);
+
+	if (!l->freelist.nr)
+		l->freelist.head = head;
+	else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+		if (next)
+			prefetchw(next);
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+			if (next)
+				prefetchw(next);
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+/*XXX		list_move(&page->lru, &l->full); */
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline int __slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return 0;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		if (unlikely(l->nr_partial)) {
+			__free_slqb_pages(page, s->order);
+			return 1;
+		}
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+#ifdef CONFIG_NUMA
+	} else {
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		if (unlikely(l->nr_partial)) {
+			spin_unlock(&n->list_lock);
+			__free_slqb_pages(page, s->order);
+			return 1;
+		}
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		spin_unlock(&n->list_lock);
+		/* XXX: could have a race here where a full page is left on
+		 * the list if we subsequently migrate to or from the node.
+		 * Should make the above node selection and stick to it.
+		 */
+#endif
+	}
+	return 1;
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	VM_BUG_ON(!n);
+	l = &n->list;
+
+//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+//		return NULL;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object))
+		object = __cache_list_get_page(s, l);
+	slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+again:
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+		object = __remote_slab_alloc(s, node);
+		if (unlikely(!object))
+			goto alloc_new;
+		return object;
+	}
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object))
+			goto alloc_new;
+	}
+	slqb_stat_inc(l, ALLOC);
+	return object;
+
+alloc_new:
+	if (unlikely(!__slab_alloc_page(s, gfpflags, node)))
+		return NULL;
+	goto again;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+ 
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+	} else {
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static inline int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+		unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	l->cache = s;
+	l->freelist.nr = 0;
+	l->freelist.head = NULL;
+	l->freelist.tail = NULL;
+	l->nr_partial = 0;
+	l->nr_slabs = 0;
+	INIT_LIST_HEAD(&l->partial);
+//	INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check = 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr = 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+			struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next = 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr = 0;
+	c->rlist.head = NULL;
+	c->rlist.tail = NULL;
+	c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER)
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+
+	if (flags & SLAB_RED_ZONE)
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+		const char *name, size_t size,
+		size_t align, unsigned long flags,
+		void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	/* XXX: perform some basic checks like SLAB does, eg. duplicate names */
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc(s, flags);
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+	return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+
+	/*
+	 * If the node still has available memory. we need kmem_cache_node
+	 * for it yet.
+	 */
+	if (nid < 0)
+		return;
+
+#if 0 // XXX: see cpu offline comment
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_node *n;
+		n = s->node[nid];
+		if (n) {
+			s->node[nid] = NULL;
+			kmem_cache_free(&kmem_node_cache, n);
+		}
+	}
+	up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+#if 0
+		down_read(&slqb_lock);
+		/* XXX: this doesn't work because objects can still be on this
+		 * CPU's list. periodic timer needs to check if a CPU is offline
+		 * and then try to cleanup from there. Same for node offline.
+		 */
+		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			if (c) {
+				kmem_cache_free(&kmem_cpu_cache, c);
+				s->cpu_slab[cpu] = NULL;
+			}
+		}
+
+		up_read(&slqb_lock);
+#endif
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		gather->stats[i] += l->stats[i];
+	}
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+			stats->stats[i] += l->stats[i];
+		}
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+		   stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+		   0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr,
+				char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+				struct attribute *attr,
+				const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_available = 1;
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+	up_write(&slqb_lock);
+
+	return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -306,7 +306,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

^ permalink raw reply	[flat|nested] 197+ messages in thread

* [patch] SLQB slab allocator
@ 2009-01-14  9:04 ` Nick Piggin
  0 siblings, 0 replies; 197+ messages in thread
From: Nick Piggin @ 2009-01-14  9:04 UTC (permalink / raw)
  To: Zhang, Yanmin, Lin Ming, Christoph Lameter, Pekka Enberg,
	linux-mm, linux-kernel, Andrew Morton

Hi,

This is the latest SLQB patch. Since last time, we have imported the sysfs
framework from SLUB, and added specific event counters things for SLQB. I
had initially been somewhat against this because it makes SLQB depend on
another complex subsystem (which itself depends back on the slab allocator).
But I guess it is not fundamentally different than /proc, and there needs to
be some reporting somewhere. The individual per-slab counters really do make
performance analysis much easier. There is a Documentation/vm/slqbinfo.c
file, which is a parser adapted from slabinfo.c for SLUB.

Fixed some bugs, including a nasty one that was causing remote objects to
sneak onto local freelist, which would mean NUMA allocation was basically
broken.

The NUMA side of things is now much more complete. NUMA policies are obeyed.
There is still a known bug where it won't run on a system with CPU-only
nodes.

CONFIG options are improved.

Credit to some of the engineers at Intel for helping run tests, contributing
ideas and patches to improve performance and fix bugs.

I think it is getting to the point where it is stable and featureful. It
really needs to be further proven in the performance area. We'd welcome
any performance results or suggestions for tests to run.

After this round of review/feedback, I plan to set about getting SLQB merged.
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long nr;
+	void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t lock;
+	struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+	struct kmlist freelist;	/* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+	int remote_free_check;	/* remote_free has reached a watermark */
+#endif
+	struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+	unsigned long nr_partial; /* Number of partial slabs (pages) */
+	struct list_head partial; /* Slabs which have some free objects */
+
+	unsigned long nr_slabs;	/* Total number of slabs allocated */
+
+	//struct list_head full;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list list; /* List for node-local slabs. */
+
+	unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist rlist;
+	struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list list;
+	spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long flags;
+	int hiwater;		/* LIFO list high watermark */
+	int freebatch;		/* LIFO freelist batch flush size */
+	int objsize;		/* The size of an object without meta data */
+	int offset;		/* Free pointer offset. */
+	int objects;		/* Number of objects in slab */
+
+	int size;		/* The size of an object including meta data */
+	int order;		/* Allocation order */
+	gfp_t allocflags;	/* gfp flags to use on allocation */
+	unsigned int colour_range;	/* range of colour counter */
+	unsigned int colour_off;		/* offset per colour */
+	void (*ctor)(void *);
+
+	const char *name;	/* Name (only for display!) */
+	struct list_head list;	/* List of slab caches */
+
+	int align;		/* Alignment */
+	int inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject kobj;	/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index = kmalloc_index(size);
+
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -828,6 +828,9 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -869,7 +872,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3368 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			unsigned int inuse;	/* Nr of objects */
+		   	struct kmem_cache_list *list; /* Pointer to list */
+			void **freelist;	/* freelist req. slab lock */
+			union {
+				struct list_head lru; /* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+					unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+						unsigned int order)
+{
+	struct page *p;
+
+	if (nid == -1)
+		p = alloc_pages(flags, order);
+	else
+		p = alloc_pages_node(nid, flags, order);
+
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize,
+			s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size)
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE)
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+						void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset)
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+
+	if (s->flags & SLAB_STORE_USER)
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON) &&
+			(!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1) ||
+			 !check_bytes_and_report(s, page, p, "Poison",
+				p + s->objsize - 1, POISON_END, 1)))
+			return 0;
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+								void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str)
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+
+	if (*str == ',')
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+
+	slqb_debug = 0;
+	if (*str == '-')
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name,
+	void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+	void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+			{ return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = alloc_slqb_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	prefetchw(head);
+
+	spin_lock(&l->remote_free.lock);
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+	spin_unlock(&l->remote_free.lock);
+
+	if (!l->freelist.nr)
+		l->freelist.head = head;
+	else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+		if (next)
+			prefetchw(next);
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+			if (next)
+				prefetchw(next);
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+/*XXX		list_move(&page->lru, &l->full); */
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+//	VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline int __slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* XXX: load any partial? */
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return 0;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		if (unlikely(l->nr_partial)) {
+			__free_slqb_pages(page, s->order);
+			return 1;
+		}
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+#ifdef CONFIG_NUMA
+	} else {
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		if (unlikely(l->nr_partial)) {
+			spin_unlock(&n->list_lock);
+			__free_slqb_pages(page, s->order);
+			return 1;
+		}
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		spin_unlock(&n->list_lock);
+		/* XXX: could have a race here where a full page is left on
+		 * the list if we subsequently migrate to or from the node.
+		 * Should make the above node selection and stick to it.
+		 */
+#endif
+	}
+	return 1;
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	VM_BUG_ON(!n);
+	l = &n->list;
+
+//	if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+//		return NULL;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object))
+		object = __cache_list_get_page(s, l);
+	slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+again:
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+		object = __remote_slab_alloc(s, node);
+		if (unlikely(!object))
+			goto alloc_new;
+		return object;
+	}
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object))
+			goto alloc_new;
+	}
+	slqb_stat_inc(l, ALLOC);
+	return object;
+
+alloc_new:
+	if (unlikely(!__slab_alloc_page(s, gfpflags, node)))
+		return NULL;
+	goto again;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+		gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+ 
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !numa_platform ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+	} else {
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+		struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+	if (numa_platform)
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static inline int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+		unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	l->cache = s;
+	l->freelist.nr = 0;
+	l->freelist.head = NULL;
+	l->freelist.tail = NULL;
+	l->nr_partial = 0;
+	l->nr_slabs = 0;
+	INIT_LIST_HEAD(&l->partial);
+//	INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check = 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr = 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+			struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next = 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr = 0;
+	c->rlist.head = NULL;
+	c->rlist.tail = NULL;
+	c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER)
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+
+	if (flags & SLAB_RED_ZONE)
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+		const char *name, size_t size,
+		size_t align, unsigned long flags,
+		void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	/* XXX: perform some basic checks like SLAB does, eg. duplicate names */
+	down_write(&slqb_lock);
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+	struct slqb_page *page = virt_to_head_slqb_page(object);
+
+	if (!(page->flags & PG_SLQB_BIT))
+		return 0;
+
+	/*
+	 * We could also check if the object is on the slabs freelist.
+	 * But this would be too expensive and it seems that the main
+	 * purpose of kmem_ptr_valid is to check if the object belongs
+	 * to a certain slab.
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+	up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_nodes(s);
+#endif
+
+	sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ *		Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc(s, flags);
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n = s->node[node];
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+	int node;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	node = numa_node_id();
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+	return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+
+	/*
+	 * If the node still has available memory. we need kmem_cache_node
+	 * for it yet.
+	 */
+	if (nid < 0)
+		return;
+
+#if 0 // XXX: see cpu offline comment
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_node *n;
+		n = s->node[nid];
+		if (n) {
+			s->node[nid] = NULL;
+			kmem_cache_free(&kmem_node_cache, n);
+		}
+	}
+	up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_read(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_read(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+	if (num_possible_nodes() == 1)
+		numa_platform = 0;
+	else
+		numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+		kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+		init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+		kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+		init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+		kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+		kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+		init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+		kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+		init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+		kmem_node_cache.node[i] = &kmem_node_nodes[i];
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_read(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+#if 0
+		down_read(&slqb_lock);
+		/* XXX: this doesn't work because objects can still be on this
+		 * CPU's list. periodic timer needs to check if a CPU is offline
+		 * and then try to cleanup from there. Same for node offline.
+		 */
+		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			if (c) {
+				kmem_cache_free(&kmem_cpu_cache, c);
+				s->cpu_slab[cpu] = NULL;
+			}
+		}
+
+		up_read(&slqb_lock);
+#endif
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+		gather->stats[i] += l->stats[i];
+	}
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+			stats->stats[i] += l->stats[i];
+		}
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+		   stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+		   0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+	gather_stats(s, &stats);
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr,
+				char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+				struct attribute *attr,
+				const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+	sysfs_available = 1;
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+	up_write(&slqb_lock);
+
+	return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -306,7 +306,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 197+ messages in thread

end of thread, other threads:[~2009-02-10  8:57 UTC | newest]

Thread overview: 197+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-21 14:30 [patch] SLQB slab allocator Nick Piggin
2009-01-21 14:30 ` Nick Piggin
2009-01-21 14:59 ` Ingo Molnar
2009-01-21 14:59   ` Ingo Molnar
2009-01-21 15:17   ` Nick Piggin
2009-01-21 15:17     ` Nick Piggin
2009-01-21 16:56   ` Nick Piggin
2009-01-21 16:56     ` Nick Piggin
2009-01-21 17:40     ` Ingo Molnar
2009-01-21 17:40       ` Ingo Molnar
2009-01-23  3:31       ` Nick Piggin
2009-01-23  3:31         ` Nick Piggin
2009-01-23  6:14       ` Nick Piggin
2009-01-23  6:14         ` Nick Piggin
2009-01-23 12:56         ` Ingo Molnar
2009-01-23 12:56           ` Ingo Molnar
2009-01-21 17:59 ` Joe Perches
2009-01-21 17:59   ` Joe Perches
2009-01-23  3:35   ` Nick Piggin
2009-01-23  3:35     ` Nick Piggin
2009-01-23  4:00     ` Joe Perches
2009-01-23  4:00       ` Joe Perches
2009-01-21 18:10 ` Hugh Dickins
2009-01-21 18:10   ` Hugh Dickins
2009-01-22 10:01   ` Pekka Enberg
2009-01-22 10:01     ` Pekka Enberg
2009-01-22 12:47     ` Hugh Dickins
2009-01-22 12:47       ` Hugh Dickins
2009-01-23 14:23       ` Hugh Dickins
2009-01-23 14:23         ` Hugh Dickins
2009-01-23 14:30         ` Pekka Enberg
2009-01-23 14:30           ` Pekka Enberg
2009-02-02  3:38         ` Zhang, Yanmin
2009-02-02  3:38           ` Zhang, Yanmin
2009-02-02  9:00           ` Pekka Enberg
2009-02-02  9:00             ` Pekka Enberg
2009-02-02 15:00             ` Christoph Lameter
2009-02-02 15:00               ` Christoph Lameter
2009-02-03  1:34               ` Zhang, Yanmin
2009-02-03  1:34                 ` Zhang, Yanmin
2009-02-03  7:29             ` Zhang, Yanmin
2009-02-03  7:29               ` Zhang, Yanmin
2009-02-03 12:18               ` Hugh Dickins
2009-02-03 12:18                 ` Hugh Dickins
2009-02-04  2:21                 ` Zhang, Yanmin
2009-02-04  2:21                   ` Zhang, Yanmin
2009-02-05 19:04                   ` Hugh Dickins
2009-02-05 19:04                     ` Hugh Dickins
2009-02-06  0:47                     ` Zhang, Yanmin
2009-02-06  0:47                       ` Zhang, Yanmin
2009-02-06  8:57                     ` Pekka Enberg
2009-02-06  8:57                       ` Pekka Enberg
2009-02-06 12:33                       ` Hugh Dickins
2009-02-06 12:33                         ` Hugh Dickins
2009-02-10  8:56                         ` Zhang, Yanmin
2009-02-10  8:56                           ` Zhang, Yanmin
2009-02-02 11:50           ` Hugh Dickins
2009-01-23  3:55   ` Nick Piggin
2009-01-23  3:55     ` Nick Piggin
2009-01-23 13:57     ` Hugh Dickins
2009-01-23 13:57       ` Hugh Dickins
2009-01-22  8:45 ` Zhang, Yanmin
2009-01-22  8:45   ` Zhang, Yanmin
2009-01-23  3:57   ` Nick Piggin
2009-01-23  3:57     ` Nick Piggin
2009-01-23  9:00   ` Nick Piggin
2009-01-23  9:00     ` Nick Piggin
2009-01-23 13:34     ` Hugh Dickins
2009-01-23 13:34       ` Hugh Dickins
2009-01-23 13:44       ` Nick Piggin
2009-01-23 13:44         ` Nick Piggin
2009-01-23  9:55 ` Andi Kleen
2009-01-23  9:55   ` Andi Kleen
2009-01-23 10:13   ` Pekka Enberg
2009-01-23 10:13     ` Pekka Enberg
2009-01-23 11:25   ` Nick Piggin
2009-01-23 11:25     ` Nick Piggin
2009-01-23 11:57     ` Andi Kleen
2009-01-23 11:57       ` Andi Kleen
2009-01-23 13:18       ` Nick Piggin
2009-01-23 13:18         ` Nick Piggin
2009-01-23 14:04         ` Andi Kleen
2009-01-23 14:04           ` Andi Kleen
2009-01-23 14:27           ` Nick Piggin
2009-01-23 14:27             ` Nick Piggin
2009-01-23 15:06             ` Andi Kleen
2009-01-23 15:06               ` Andi Kleen
2009-01-23 15:15               ` Nick Piggin
2009-01-23 15:15                 ` Nick Piggin
2009-01-23 12:55   ` Nick Piggin
2009-01-23 12:55     ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2009-01-14  9:04 Nick Piggin
2009-01-14  9:04 ` Nick Piggin
2009-01-14 10:53 ` Pekka Enberg
2009-01-14 10:53   ` Pekka Enberg
2009-01-14 11:47   ` Nick Piggin
2009-01-14 11:47     ` Nick Piggin
2009-01-14 13:44     ` Pekka Enberg
2009-01-14 13:44       ` Pekka Enberg
2009-01-14 14:22       ` Nick Piggin
2009-01-14 14:22         ` Nick Piggin
2009-01-14 14:45         ` Pekka Enberg
2009-01-14 14:45           ` Pekka Enberg
2009-01-14 15:09           ` Nick Piggin
2009-01-14 15:09             ` Nick Piggin
2009-01-14 15:22             ` Nick Piggin
2009-01-14 15:22               ` Nick Piggin
2009-01-14 15:30               ` Pekka Enberg
2009-01-14 15:30                 ` Pekka Enberg
2009-01-14 15:59                 ` Nick Piggin
2009-01-14 15:59                   ` Nick Piggin
2009-01-14 18:40                   ` Christoph Lameter
2009-01-14 18:40                     ` Christoph Lameter
2009-01-15  6:19                     ` Nick Piggin
2009-01-15  6:19                       ` Nick Piggin
2009-01-15 20:47                       ` Christoph Lameter
2009-01-15 20:47                         ` Christoph Lameter
2009-01-16  3:43                         ` Nick Piggin
2009-01-16  3:43                           ` Nick Piggin
2009-01-16 21:25                           ` Christoph Lameter
2009-01-16 21:25                             ` Christoph Lameter
2009-01-19  6:18                             ` Nick Piggin
2009-01-19  6:18                               ` Nick Piggin
2009-01-22  0:13                               ` Christoph Lameter
2009-01-22  0:13                                 ` Christoph Lameter
2009-01-22  9:27                                 ` Pekka Enberg
2009-01-22  9:27                                   ` Pekka Enberg
2009-01-22  9:30                                   ` Zhang, Yanmin
2009-01-22  9:30                                     ` Zhang, Yanmin
2009-01-22  9:33                                     ` Pekka Enberg
2009-01-22  9:33                                       ` Pekka Enberg
2009-01-23 15:32                                       ` Christoph Lameter
2009-01-23 15:32                                         ` Christoph Lameter
2009-01-23 15:37                                         ` Pekka Enberg
2009-01-23 15:37                                           ` Pekka Enberg
2009-01-23 15:42                                           ` Christoph Lameter
2009-01-23 15:42                                             ` Christoph Lameter
2009-01-23 15:32                                   ` Christoph Lameter
2009-01-23 15:32                                     ` Christoph Lameter
2009-01-23  4:09                                 ` Nick Piggin
2009-01-23  4:09                                   ` Nick Piggin
2009-01-23 15:41                                   ` Christoph Lameter
2009-01-23 15:41                                     ` Christoph Lameter
2009-01-23 15:53                                     ` Nick Piggin
2009-01-23 15:53                                       ` Nick Piggin
2009-01-26 17:28                                       ` Christoph Lameter
2009-01-26 17:28                                         ` Christoph Lameter
2009-02-03  1:53                                         ` Nick Piggin
2009-02-03  1:53                                           ` Nick Piggin
2009-02-03 17:33                                           ` Christoph Lameter
2009-02-03 17:33                                             ` Christoph Lameter
2009-02-03 18:42                                             ` Pekka Enberg
2009-02-03 18:42                                               ` Pekka Enberg
2009-02-03 18:47                                               ` Pekka Enberg
2009-02-03 18:47                                                 ` Pekka Enberg
2009-02-04  4:22                                                 ` Nick Piggin
2009-02-04  4:22                                                   ` Nick Piggin
2009-02-04 20:09                                                   ` Christoph Lameter
2009-02-04 20:09                                                     ` Christoph Lameter
2009-02-05  3:18                                                     ` Nick Piggin
2009-02-05  3:18                                                       ` Nick Piggin
2009-02-04 20:10                                               ` Christoph Lameter
2009-02-04 20:10                                                 ` Christoph Lameter
2009-02-05  3:14                                                 ` Nick Piggin
2009-02-05  3:14                                                   ` Nick Piggin
2009-02-04  4:07                                             ` Nick Piggin
2009-02-04  4:07                                               ` Nick Piggin
2009-01-14 18:01             ` Christoph Lameter
2009-01-14 18:01               ` Christoph Lameter
2009-01-15  6:03               ` Nick Piggin
2009-01-15  6:03                 ` Nick Piggin
2009-01-15 20:05                 ` Christoph Lameter
2009-01-15 20:05                   ` Christoph Lameter
2009-01-16  3:19                   ` Nick Piggin
2009-01-16  3:19                     ` Nick Piggin
2009-01-16 21:07                     ` Christoph Lameter
2009-01-16 21:07                       ` Christoph Lameter
2009-01-19  5:47                       ` Nick Piggin
2009-01-19  5:47                         ` Nick Piggin
2009-01-22  0:19                         ` Christoph Lameter
2009-01-22  0:19                           ` Christoph Lameter
2009-01-23  4:17                           ` Nick Piggin
2009-01-23  4:17                             ` Nick Piggin
2009-01-23 15:52                             ` Christoph Lameter
2009-01-23 15:52                               ` Christoph Lameter
2009-01-23 16:10                               ` Nick Piggin
2009-01-23 16:10                                 ` Nick Piggin
2009-01-23 17:09                                 ` Nick Piggin
2009-01-23 17:09                                   ` Nick Piggin
2009-01-26 17:46                                   ` Christoph Lameter
2009-01-26 17:46                                     ` Christoph Lameter
2009-02-03  1:42                                     ` Nick Piggin
2009-02-03  1:42                                       ` Nick Piggin
2009-01-26 17:34                                 ` Christoph Lameter
2009-01-26 17:34                                   ` Christoph Lameter
2009-02-03  1:48                                   ` Nick Piggin
2009-02-03  1:48                                     ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.