linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] SLQB slab allocator (try 2)
@ 2009-01-23 15:46 Nick Piggin
  2009-01-24  2:38 ` Zhang, Yanmin
  2009-01-26  8:48 ` Pekka Enberg
  0 siblings, 2 replies; 55+ messages in thread
From: Nick Piggin @ 2009-01-23 15:46 UTC (permalink / raw)
  To: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi,

Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up the
code suggested by Ingo (haven't yet incorporated Ingo's last patch).

Should have fixed the crash reported by Yanmin (I was able to reproduce it
on an ia64 system and fix it).

Significantly reduced static footprint of init arrays, thanks to Andi's
suggestion.

Please consider for trial merge for linux-next.

Thanks,
Nick

--
Introducing the SLQB slab allocator.

SLQB takes code and ideas from all other slab allocators in the tree.

The primary method for keeping lists of free objects within the allocator
is a singly-linked list, storing a pointer within the object memory itself
(or a small additional space in the case of RCU destroyed slabs). This is
like SLOB and SLUB, and opposed to SLAB, which uses arrays of objects, and
metadata. This reduces memory consumption and makes smaller sized objects
more realistic as there is less overhead.

Using lists rather than arrays can reduce the cacheline footprint. When moving
objects around, SLQB can move a list of objects from one CPU to another by
simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
can be touched during alloc/free. Newly freed objects tend to be cache hot,
and newly allocated ones tend to soon be touched anyway, so often there is
little cost to using metadata in the objects.

SLQB has a per-CPU LIFO freelist of objects like SLAB (but using lists rather
than arrays). Freed objects are returned to this freelist if they belong to
the node which our CPU belongs to. So objects allocated on one CPU can be
added to the freelist of another CPU on the same node. When LIFO freelists need
to be refilled or trimmed, SLQB takes or returns objects from a list of slabs.

SLQB has per-CPU lists of slabs (which use struct page as their metadata
including list head for this list). Each slab contains a singly-linked list of
objects that are free in that slab (free, and not on a LIFO freelist). Slabs
are freed as soon as all their objects are freed, and only allocated when there
are no slabs remaining. They are taken off this slab list when if there are no
free objects left. So the slab lists always only contain "partial" slabs; those
slabs which are not completely full and not completely empty. SLQB slabs can be
manipulated with no locking unlike other allocators which tend to use per-node
locks. As the number of threads per socket increases, this should help improve
the scalability of slab operations.

Freeing objects to remote slab lists first batches up the objects on the
freeing CPU, then moves them over at once to a list on the allocating CPU. The
allocating CPU will then notice those objects and pull them onto the end of its
freelist.  This remote freeing scheme is designed to minimise the number of
cross CPU cachelines touched, short of going to a "crossbar" arrangement like
SLAB has.  SLAB has "crossbars" of arrays of objects. That is,
NR_CPUS*MAX_NUMNODES type arrays, which can become very bloated in huge systems
(this could be hundreds of GBs for kmem caches for 4096 CPU, 1024 nodes
systems).

SLQB also has similar freelist, slablist structures per-node, which are
protected by a lock, and usable by any CPU in order to do node specific
allocations. These allocations tend not to be too frequent (short lived
allocations should be node local, long lived allocations should not be
too frequent).

There is a good overview and illustration of the design here:

http://lwn.net/Articles/311502/

By using LIFO freelists like SLAB, SLQB tries to be very page-size agnostic.
It tries very hard to use order-0 pages. This is good for both page allocator
fragmentation, and slab fragmentation.

SLQB initialistaion code attempts to be as simple and un-clever as possible.
There are no multiple phases where different things come up. There is no
weird self bootstrapping stuff. It just statically allocates the structures
required to create the slabs that allocate other slab structures.

SLQB uses much of the debugging infrastructure, and fine-grained sysfs
statistics from SLUB. There is also a Documentation/vm/slqbinfo.c, derived
from slabinfo.c, which can query the sysfs data.

 Documentation/vm/slqbinfo.c | 1054 +++++++++++++
 arch/x86/include/asm/page.h |    1
 include/linux/mm.h          |    4
 include/linux/rcu_types.h   |   18
 include/linux/rcupdate.h    |   11
 include/linux/slab.h        |   10
 include/linux/slqb_def.h    |  295 +++
 init/Kconfig                |    9
 lib/Kconfig.debug           |   20
 mm/Makefile                 |    1
 mm/slqb.c                   | 3562 ++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 4971 insertions(+), 14 deletions(-)

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
+#include <linux/rcu_types.h>
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
@@ -42,16 +43,6 @@
 #include <linux/lockdep.h>
 #include <linux/completion.h>
 
-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
-	struct rcu_head *next;
-	void (*func)(struct rcu_head *head);
-};
-
 #if defined(CONFIG_CLASSIC_RCU)
 #include <linux/rcuclassic.h>
 #elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,295 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <npiggin@suse.de>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+#define SLAB_NUMA		0x00000001UL    /* shortcut */
+
+enum stat_item {
+	ALLOC,			/* Allocation count */
+	ALLOC_SLAB_FILL,	/* Fill freelist from page list */
+	ALLOC_SLAB_NEW,		/* New slab acquired from page allocator */
+	FREE,			/* Free count */
+	FREE_REMOTE,		/* NUMA: freeing to remote list */
+	FLUSH_FREE_LIST,	/* Freelist flushed */
+	FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+	FLUSH_FREE_LIST_REMOTE,	/* Objects flushed from freelist to remote */
+	FLUSH_SLAB_PARTIAL,	/* Freeing moves slab to partial list */
+	FLUSH_SLAB_FREE,	/* Slab freed to the page allocator */
+	FLUSH_RFREE_LIST,	/* Rfree list flushed */
+	FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+	CLAIM_REMOTE_LIST,	/* Remote freed list claimed */
+	CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+	NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+	unsigned long	nr;
+	void 		**head;
+	void		**tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+	spinlock_t	lock;
+	struct kmlist	list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+				/* Fastpath LIFO freelist of objects */
+	struct kmlist		freelist;
+#ifdef CONFIG_SMP
+				/* remote_free has reached a watermark */
+	int			remote_free_check;
+#endif
+				/* kmem_cache corresponding to this list */
+	struct kmem_cache	*cache;
+
+				/* Number of partial slabs (pages) */
+	unsigned long		nr_partial;
+
+				/* Slabs which have some free objects */
+	struct list_head	partial;
+
+				/* Total number of slabs allocated */
+	unsigned long		nr_slabs;
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the case of per-cpu lists, remote_free is for objects freed by
+	 * non-owner CPU back to its home list. For per-node lists, remote_free
+	 * is always used to free objects.
+	 */
+	struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long		stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+	struct kmem_cache_list	list;		/* List for node-local slabs */
+	unsigned int		colour_next;	/* Next colour offset to use */
+
+#ifdef CONFIG_SMP
+	/*
+	 * rlist is a list of objects that don't fit on list.freelist (ie.
+	 * wrong node). The objects all correspond to a given kmem_cache_list,
+	 * remote_cache_list. To free objects to another list, we must first
+	 * flush the existing objects, then switch remote_cache_list.
+	 *
+	 * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+	 * get to O(NR_CPUS^2) memory consumption situation.
+	 */
+	struct kmlist		rlist;
+	struct kmem_cache_list	*remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
+ */
+struct kmem_cache_node {
+	struct kmem_cache_list	list;
+	spinlock_t		list_lock;	/* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+	unsigned long	flags;
+	int		hiwater;	/* LIFO list high watermark */
+	int		freebatch;	/* LIFO freelist batch flush size */
+	int		objsize;	/* Size of object without meta data */
+	int		offset;		/* Free pointer offset. */
+	int		objects;	/* Number of objects in slab */
+
+	int		size;		/* Size of object including meta data */
+	int		order;		/* Allocation order */
+	gfp_t		allocflags;	/* gfp flags to use on allocation */
+	unsigned int	colour_range;	/* range of colour counter */
+	unsigned int	colour_off;	/* offset per colour */
+	void		(*ctor)(void *);
+
+	const char	*name;		/* Name (only for display!) */
+	struct list_head list;		/* List of slab caches */
+
+	int		align;		/* Alignment */
+	int		inuse;		/* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+	struct kobject	kobj;		/* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node	*node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu	*cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu	cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+	if (unlikely(!size))
+		return 0;
+	if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+		return 0;
+
+	if (unlikely(size <= KMALLOC_MIN_SIZE))
+		return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+	if (size > 64 && size <= 96)
+		return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+	if (size > 128 && size <= 192)
+		return 2;
+#endif
+	if (size <=	  8) return 3;
+	if (size <=	 16) return 4;
+	if (size <=	 32) return 5;
+	if (size <=	 64) return 6;
+	if (size <=	128) return 7;
+	if (size <=	256) return 8;
+	if (size <=	512) return 9;
+	if (size <=       1024) return 10;
+	if (size <=   2 * 1024) return 11;
+	if (size <=   4 * 1024) return 12;
+	if (size <=   8 * 1024) return 13;
+	if (size <=  16 * 1024) return 14;
+	if (size <=  32 * 1024) return 15;
+	if (size <=  64 * 1024) return 16;
+	if (size <= 128 * 1024) return 17;
+	if (size <= 256 * 1024) return 18;
+	if (size <= 512 * 1024) return 19;
+	if (size <= 1024 * 1024) return 20;
+	if (size <=  2 * 1024 * 1024) return 21;
+	return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size. Should really only be used for constant 'size' arguments, due to
+ * bloat.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+	BUILD_BUG_ON(!__builtin_constant_p(size));
+
+	index = kmalloc_index(size);
+	if (unlikely(index == 0))
+		return NULL;
+
+	if (likely(!(flags & SLQB_DMA)))
+		return &kmalloc_caches[index];
+	else
+		return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ?	\
+				sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc(s, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	if (__builtin_constant_p(size)) {
+		struct kmem_cache *s;
+
+		s = kmalloc_slab(size, flags);
+		if (unlikely(ZERO_OR_NULL_PTR(s)))
+			return s;
+
+		return kmem_cache_alloc_node(s, flags, node);
+	}
+	return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG
 
 choice
 	prompt "Choose SLAB allocator"
-	default SLUB
+	default SLQB
 	help
 	   This option allows to select a slab allocator.
 
@@ -827,6 +827,11 @@ config SLUB
 	   and has enhanced diagnostics. SLUB is the default choice for
 	   a slab allocator.
 
+config SLQB
+	bool "SLQB (Qeued allocator)"
+	help
+	  SLQB is a proposed new slab allocator.
+
 config SLOB
 	depends on EMBEDDED
 	bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
 config SLABINFO
 	bool
 	depends on PROC_FS
-	depends on SLAB || SLUB_DEBUG
+	depends on SLAB || SLUB_DEBUG || SLQB
 	default y
 
 config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
+config SLQB_DEBUG
+	default y
+	bool "Enable SLQB debugging support"
+	depends on SLQB
+
+config SLQB_DEBUG_ON
+	default n
+	bool "SLQB debugging on by default"
+	depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+	bool "Create SYSFS entries for slab caches"
+	default n
+	depends on SLQB
+
+config SLQB_STATS
+	bool "Enable SLQB performance statistics"
+	default n
+	depends on SLQB_SYSFS
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3562 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ *   they don't get cumulatively leaked with successive online/offline cycles
+ * - allow OOM conditions to flush back per-CPU pages to common lists to be
+ *   reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ *   a default closest home node via which it can use fastpath functions.
+ *   Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+	union {
+		struct {
+			unsigned long	flags;		/* mandatory */
+			atomic_t	_count;		/* mandatory */
+			unsigned int	inuse;		/* Nr of objects */
+			struct kmem_cache_list *list;	/* Pointer to list */
+			void		 **freelist;	/* LIFO freelist */
+			union {
+				struct list_head lru;	/* misc. list */
+				struct rcu_head rcu_head; /* for rcu freeing */
+			};
+		};
+		struct page page;
+	};
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return s->flags & SLAB_NUMA;
+}
+#else
+static inline int slab_numa(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+	return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+	return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ *   kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ *   objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ *   the object list, it tries from the partial list, then the free list.
+ *   After freeing an object to the object list, if it is over a watermark,
+ *   some objects are freed back to pages. If an allocation misses these lists,
+ *   a new slab page is allocated from the page allocator. If the free list
+ *   reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ *   node are queued to. When this reaches a watermark, the objects are
+ *   flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ *   to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ *   used to protect access to this queue.
+ *
+ *   When the remotely freed queue reaches a watermark, a flag is set to tell
+ *   the owner CPU to check it. The owner CPU will then check the queue on the
+ *   next allocation that misses the object list. It will move all objects from
+ *   this list onto the object list and then allocate one.
+ *
+ *   This system of remote queueing is intended to reduce lock and remote
+ *   cacheline acquisitions, and give a cooling off period for remotely freed
+ *   objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ *   allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+				enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+				enum stat_item si, unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+	list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+	return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+	return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+	return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+	return page_to_nid(virt_to_page_fast(addr));
+#else
+	return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+	struct page *p;
+
+	p = virt_to_head_page(addr);
+	return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+	struct page *p = &page->page;
+
+	reset_page_mapcount(p);
+	p->mapping = NULL;
+	VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+	p->flags &= ~PG_SLQB_BIT;
+
+	__free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return (s->flags &
+			(SLAB_DEBUG_FREE |
+			 SLAB_RED_ZONE |
+			 SLAB_POISON |
+			 SLAB_STORE_USER |
+			 SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+	return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+				SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON		0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size()	L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/*
+ * slqb_lock protects slab_caches list and serialises hotplug operations.
+ * hotplug operations take lock for write, other operations can hold off
+ * hotplug by taking it for read (or write).
+ */
+static DECLARE_RWSEM(slqb_lock);
+
+/*
+ * A list of all slab caches on the system
+ */
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+	void *addr;		/* Called from address */
+	int cpu;		/* Was running on cpu */
+	int pid;		/* Pid context */
+	unsigned long when;	/* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+	return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * 			Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+	return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+	VM_BUG_ON(!s->cpu_slab[cpu]);
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+				struct slqb_page *page, const void *object)
+{
+	void *base;
+
+	base = slqb_page_address(page);
+	if (object < base || object >= base + s->objects * s->size ||
+		(object - base) % s->size) {
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+	return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+	*(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+	for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+			__p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+	for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+		__p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+	int i, offset;
+	int newline = 1;
+	char ascii[17];
+
+	ascii[16] = 0;
+
+	for (i = 0; i < length; i++) {
+		if (newline) {
+			printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+			newline = 0;
+		}
+		printk(KERN_CONT " %02x", addr[i]);
+		offset = i % 16;
+		ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+		if (offset == 15) {
+			printk(KERN_CONT " %s\n", ascii);
+			newline = 1;
+		}
+	}
+	if (!newline) {
+		i %= 16;
+		while (i < 16) {
+			printk(KERN_CONT "   ");
+			ascii[i] = ' ';
+			i++;
+		}
+		printk(KERN_CONT " %s\n", ascii);
+	}
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+	enum track_item alloc)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+				enum track_item alloc, void *addr)
+{
+	struct track *p;
+
+	if (s->offset)
+		p = object + s->offset + sizeof(void *);
+	else
+		p = object + s->inuse;
+
+	p += alloc;
+	if (addr) {
+		p->addr = addr;
+		p->cpu = raw_smp_processor_id();
+		p->pid = current ? current->pid : -1;
+		p->when = jiffies;
+	} else
+		memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	set_track(s, object, TRACK_FREE, NULL);
+	set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+	if (!t->addr)
+		return;
+
+	printk(KERN_ERR "INFO: %s in ", s);
+	__print_symbol("%s", (unsigned long)t->addr);
+	printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
+	print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+	print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+	printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+		page, page->inuse, page->freelist, page->flags);
+
+}
+
+#define MAX_ERR_STR 100
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[MAX_ERR_STR];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "========================================"
+			"=====================================\n");
+	printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+	printk(KERN_ERR "----------------------------------------"
+			"-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned int off;	/* Offset of last byte */
+	u8 *addr = slqb_page_address(page);
+
+	print_tracking(s, p);
+
+	print_page_info(page);
+
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+			p, p - addr, get_freepointer(s, p));
+
+	if (p > addr + 16)
+		print_section("Bytes b4", p - 16, 16);
+
+	print_section("Object", p, min(s->objsize, 128));
+
+	if (s->flags & SLAB_RED_ZONE)
+		print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
+
+	if (s->offset)
+		off = s->offset + sizeof(void *);
+	else
+		off = s->inuse;
+
+	if (s->flags & SLAB_STORE_USER)
+		off += 2 * sizeof(struct track);
+
+	if (off != s->size) {
+		/* Beginning of the filler is the free pointer */
+		print_section("Padding", p + off, s->size - off);
+	}
+
+	dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *reason)
+{
+	slab_bug(s, reason);
+	print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+			char *fmt, ...)
+{
+	slab_bug(s, fmt);
+	print_page_info(page);
+	dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+	u8 *p = object;
+
+	if (s->flags & __OBJECT_POISON) {
+		memset(p, POISON_FREE, s->objsize - 1);
+		p[s->objsize - 1] = POISON_END;
+	}
+
+	if (s->flags & SLAB_RED_ZONE) {
+		memset(p + s->objsize,
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+			s->inuse - s->objsize);
+	}
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+	while (bytes) {
+		if (*start != (u8)value)
+			return start;
+		start++;
+		bytes--;
+	}
+	return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+				void *from, void *to)
+{
+	slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+	memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+			u8 *object, char *what,
+			u8 *start, unsigned int value, unsigned int bytes)
+{
+	u8 *fault;
+	u8 *end;
+
+	fault = check_bytes(start, value, bytes);
+	if (!fault)
+		return 1;
+
+	end = start + bytes;
+	while (end > fault && end[-1] == value)
+		end--;
+
+	slab_bug(s, "%s overwritten", what);
+	printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+					fault, end - 1, fault[0], value);
+	print_trailer(s, page, object);
+
+	restore_bytes(s, what, value, fault, end);
+	return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * 	Bytes of the object to be managed.
+ * 	If the freepointer may overlay the object then the free
+ * 	pointer is the first word of the object.
+ *
+ * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 	0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * 	Padding to reach word boundary. This is also used for Redzoning.
+ * 	Padding is extended by another word if Redzoning is enabled and
+ * 	objsize == inuse.
+ *
+ * 	We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 	0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * 	Meta data starts here.
+ *
+ * 	A. Free pointer (if we cannot overwrite object on free)
+ * 	B. Tracking data for SLAB_STORE_USER
+ * 	C. Padding to reach required alignment boundary or at mininum
+ * 		one word if debuggin is on to be able to detect writes
+ * 		before the word boundary.
+ *
+ *	Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * 	Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+	unsigned long off = s->inuse;	/* The end of info */
+
+	if (s->offset) {
+		/* Freepointer is placed after the object. */
+		off += sizeof(void *);
+	}
+
+	if (s->flags & SLAB_STORE_USER) {
+		/* We also have user information there */
+		off += 2 * sizeof(struct track);
+	}
+
+	if (s->size == off)
+		return 1;
+
+	return check_bytes_and_report(s, page, p, "Object padding",
+				p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	u8 *start;
+	u8 *fault;
+	u8 *end;
+	int length;
+	int remainder;
+
+	if (!(s->flags & SLAB_POISON))
+		return 1;
+
+	start = slqb_page_address(page);
+	end = start + (PAGE_SIZE << s->order);
+	length = s->objects * s->size;
+	remainder = end - (start + length);
+	if (!remainder)
+		return 1;
+
+	fault = check_bytes(start + length, POISON_INUSE, remainder);
+	if (!fault)
+		return 1;
+
+	while (end > fault && end[-1] == POISON_INUSE)
+		end--;
+
+	slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+	print_section("Padding", start, length);
+
+	restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+	return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+					void *object, int active)
+{
+	u8 *p = object;
+	u8 *endobject = object + s->objsize;
+
+	if (s->flags & SLAB_RED_ZONE) {
+		unsigned int red =
+			active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+		if (!check_bytes_and_report(s, page, object, "Redzone",
+			endobject, red, s->inuse - s->objsize))
+			return 0;
+	} else {
+		if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+			check_bytes_and_report(s, page, p, "Alignment padding",
+				endobject, POISON_INUSE, s->inuse - s->objsize);
+		}
+	}
+
+	if (s->flags & SLAB_POISON) {
+		if (!active && (s->flags & __OBJECT_POISON)) {
+			if (!check_bytes_and_report(s, page, p, "Poison", p,
+					POISON_FREE, s->objsize - 1))
+				return 0;
+
+			if (!check_bytes_and_report(s, page, p, "Poison",
+					p + s->objsize - 1, POISON_END, 1))
+				return 0;
+		}
+
+		/*
+		 * check_pad_bytes cleans up on its own.
+		 */
+		check_pad_bytes(s, page, p);
+	}
+
+	return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	if (!(page->flags & PG_SLQB_BIT)) {
+		slab_err(s, page, "Not a valid slab page");
+		return 0;
+	}
+	if (page->inuse == 0) {
+		slab_err(s, page, "inuse before free / after alloc", s->name);
+		return 0;
+	}
+	if (page->inuse > s->objects) {
+		slab_err(s, page, "inuse %u > max %u",
+			s->name, page->inuse, s->objects);
+		return 0;
+	}
+	/* Slab_pad_check fixes things up after itself */
+	slab_pad_check(s, page);
+	return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int alloc)
+{
+	if (s->flags & SLAB_TRACE) {
+		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+			s->name,
+			alloc ? "alloc" : "free",
+			object, page->inuse,
+			page->freelist);
+
+		if (!alloc)
+			print_section("Object", (void *)object, s->objsize);
+
+		dump_stack();
+	}
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+				void *object)
+{
+	if (!slab_debug(s))
+		return;
+
+	if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+		return;
+
+	init_object(s, object, 0);
+	init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto bad;
+
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Freelist Pointer check fails");
+		goto bad;
+	}
+
+	if (object && !check_object(s, page, object, 0))
+		goto bad;
+
+	/* Success perform special debug activities for allocs */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_ALLOC, addr);
+	trace(s, page, object, 1);
+	init_object(s, object, 1);
+	return 1;
+
+bad:
+	return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s,
+					void *object, void *addr)
+{
+	struct slqb_page *page;
+	page = virt_to_head_slqb_page(object);
+
+	if (!check_slab(s, page))
+		goto fail;
+
+	if (!check_valid_pointer(s, page, object)) {
+		slab_err(s, page, "Invalid object pointer 0x%p", object);
+		goto fail;
+	}
+
+	if (!check_object(s, page, object, 1))
+		return 0;
+
+	/* Special debug activities for freeing objects */
+	if (s->flags & SLAB_STORE_USER)
+		set_track(s, object, TRACK_FREE, addr);
+	trace(s, page, object, 0);
+	init_object(s, object, 0);
+	return 1;
+
+fail:
+	slab_fix(s, "Object at 0x%p not freed", object);
+	return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+	slqb_debug = DEBUG_DEFAULT_FLAGS;
+	if (*str++ != '=' || !*str) {
+		/*
+		 * No options specified. Switch on full debugging.
+		 */
+		goto out;
+	}
+
+	if (*str == ',') {
+		/*
+		 * No options but restriction on slabs. This means full
+		 * debugging for slabs matching a pattern.
+		 */
+		goto check_slabs;
+	}
+
+	slqb_debug = 0;
+	if (*str == '-') {
+		/*
+		 * Switch off all debugging measures.
+		 */
+		goto out;
+	}
+
+	/*
+	 * Determine which debug features should be switched on
+	 */
+	for (; *str && *str != ','; str++) {
+		switch (tolower(*str)) {
+		case 'f':
+			slqb_debug |= SLAB_DEBUG_FREE;
+			break;
+		case 'z':
+			slqb_debug |= SLAB_RED_ZONE;
+			break;
+		case 'p':
+			slqb_debug |= SLAB_POISON;
+			break;
+		case 'u':
+			slqb_debug |= SLAB_STORE_USER;
+			break;
+		case 't':
+			slqb_debug |= SLAB_TRACE;
+			break;
+		default:
+			printk(KERN_ERR "slqb_debug option '%c' "
+				"unknown. skipped\n", *str);
+		}
+	}
+
+check_slabs:
+	if (*str == ',')
+		slqb_debug_slabs = str + 1;
+out:
+	return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+				unsigned long flags, const char *name,
+				void (*ctor)(void *))
+{
+	/*
+	 * Enable debugging if selected on the kernel commandline.
+	 */
+	if (slqb_debug && (!slqb_debug_slabs ||
+	    strncmp(slqb_debug_slabs, name,
+		strlen(slqb_debug_slabs)) == 0))
+			flags |= slqb_debug;
+
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
+
+	return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+			struct slqb_page *page, void *object)
+{
+}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int free_debug_processing(struct kmem_cache *s,
+			void *object, void *addr)
+{
+	return 0;
+}
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+	return 1;
+}
+
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+			void *object, int active)
+{
+	return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+	unsigned long flags, const char *name, void (*ctor)(void *))
+{
+	if (num_possible_nodes() > 1)
+		flags |= SLAB_NUMA;
+	return flags;
+}
+
+static const int slqb_debug = 0;
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+					gfp_t flags, int node)
+{
+	struct slqb_page *page;
+	int pages = 1 << s->order;
+
+	flags |= s->allocflags;
+
+	page = (struct slqb_page *)alloc_pages_node(node, flags, s->order);
+	if (!page)
+		return NULL;
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		pages);
+
+	return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	setup_object_debug(s, page, object);
+	if (unlikely(s->ctor))
+		s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+				gfp_t flags, int node, unsigned int colour)
+{
+	struct slqb_page *page;
+	void *start;
+	void *last;
+	void *p;
+
+	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+	page = allocate_slab(s,
+		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	if (!page)
+		goto out;
+
+	page->flags |= PG_SLQB_BIT;
+
+	start = page_address(&page->page);
+
+	if (unlikely(slab_poison(s)))
+		memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+	start += colour;
+
+	last = start;
+	for_each_object(p, s, start) {
+		setup_object(s, page, p);
+		set_freepointer(s, last, p);
+		last = p;
+	}
+	set_freepointer(s, last, NULL);
+
+	page->freelist = start;
+	page->inuse = 0;
+out:
+	return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	int pages = 1 << s->order;
+
+	if (unlikely(slab_debug(s))) {
+		void *p;
+
+		slab_pad_check(s, page);
+		for_each_free_object(p, s, page->freelist)
+			check_object(s, page, p, 0);
+	}
+
+	mod_zone_page_state(slqb_page_zone(page),
+		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+		-pages);
+
+	__free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+	struct slqb_page *page;
+
+	page = container_of((struct list_head *)h, struct slqb_page, lru);
+	__free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+	VM_BUG_ON(page->inuse);
+	if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, rcu_free_slab);
+	else
+		__free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s,
+			struct kmem_cache_list *l, struct slqb_page *page,
+			void *object)
+{
+	VM_BUG_ON(page->list != l);
+
+	set_freepointer(s, object, page->freelist);
+	page->freelist = object;
+	page->inuse--;
+
+	if (!page->inuse) {
+		if (likely(s->objects > 1)) {
+			l->nr_partial--;
+			list_del(&page->lru);
+		}
+		l->nr_slabs--;
+		free_slab(s, page);
+		slqb_stat_inc(l, FLUSH_SLAB_FREE);
+		return 1;
+
+	} else if (page->inuse + 1 == s->objects) {
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+		return 0;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SMP
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+				void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	struct kmem_cache_cpu *c;
+	void **head;
+	int nr;
+
+	nr = l->freelist.nr;
+	if (unlikely(!nr))
+		return;
+
+	nr = min(slab_freebatch(s), nr);
+
+	slqb_stat_inc(l, FLUSH_FREE_LIST);
+	slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+	c = get_cpu_slab(s, smp_processor_id());
+
+	l->freelist.nr -= nr;
+	head = l->freelist.head;
+
+	do {
+		struct slqb_page *page;
+		void **object;
+
+		object = head;
+		VM_BUG_ON(!object);
+		head = get_freepointer(s, object);
+		page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+		if (page->list != l) {
+			slab_free_to_remote(s, page, object, c);
+			slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+		} else
+#endif
+			free_object_to_page(s, l, page, object);
+
+		nr--;
+	} while (nr);
+
+	l->freelist.head = head;
+	if (!l->freelist.nr)
+		l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+	while (l->freelist.nr)
+		flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s,
+					struct kmem_cache_list *l)
+{
+	void **head, **tail;
+	int nr;
+
+	if (!l->remote_free.list.nr)
+		return;
+
+	spin_lock(&l->remote_free.lock);
+
+	l->remote_free_check = 0;
+	head = l->remote_free.list.head;
+	l->remote_free.list.head = NULL;
+	tail = l->remote_free.list.tail;
+	l->remote_free.list.tail = NULL;
+	nr = l->remote_free.list.nr;
+	l->remote_free.list.nr = 0;
+
+	spin_unlock(&l->remote_free.lock);
+
+	VM_BUG_ON(!nr);
+
+	if (!l->freelist.nr) {
+		/* Get head hot for likely subsequent allocation or flush */
+		prefetchw(head);
+		l->freelist.head = head;
+	} else
+		set_freepointer(s, l->freelist.tail, head);
+	l->freelist.tail = tail;
+
+	l->freelist.nr += nr;
+
+	slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+	slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+						struct kmem_cache_list *l)
+{
+	void *object;
+
+	object = l->freelist.head;
+	if (likely(object)) {
+		void *next = get_freepointer(s, object);
+
+		VM_BUG_ON(!l->freelist.nr);
+		l->freelist.nr--;
+		l->freelist.head = next;
+
+		return object;
+	}
+	VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+	if (unlikely(l->remote_free_check)) {
+		claim_remote_free_list(s, l);
+
+		if (l->freelist.nr > slab_hiwater(s))
+			flush_free_list(s, l);
+
+		/* repetition here helps gcc :( */
+		object = l->freelist.head;
+		if (likely(object)) {
+			void *next = get_freepointer(s, object);
+
+			VM_BUG_ON(!l->freelist.nr);
+			l->freelist.nr--;
+			l->freelist.head = next;
+
+			return object;
+		}
+		VM_BUG_ON(l->freelist.nr);
+	}
+#endif
+
+	return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	struct slqb_page *page;
+	void *object;
+
+	if (unlikely(!l->nr_partial))
+		return NULL;
+
+	page = list_first_entry(&l->partial, struct slqb_page, lru);
+	VM_BUG_ON(page->inuse == s->objects);
+	if (page->inuse + 1 == s->objects) {
+		l->nr_partial--;
+		list_del(&page->lru);
+	}
+
+	VM_BUG_ON(!page->freelist);
+
+	page->inuse++;
+
+	object = page->freelist;
+	page->freelist = get_freepointer(s, object);
+	if (page->freelist)
+		prefetchw(page->freelist);
+	VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+	slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+	return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct slqb_page *page;
+	struct kmem_cache_list *l;
+	struct kmem_cache_cpu *c;
+	unsigned int colour;
+	void *object;
+
+	c = get_cpu_slab(s, smp_processor_id());
+	colour = c->colour_next;
+	c->colour_next += s->colour_off;
+	if (c->colour_next >= s->colour_range)
+		c->colour_next = 0;
+
+	/* Caller handles __GFP_ZERO */
+	gfpflags &= ~__GFP_ZERO;
+
+	if (gfpflags & __GFP_WAIT)
+		local_irq_enable();
+	page = new_slab_page(s, gfpflags, node, colour);
+	if (gfpflags & __GFP_WAIT)
+		local_irq_disable();
+	if (unlikely(!page))
+		return page;
+
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+		struct kmem_cache_cpu *c;
+		int cpu = smp_processor_id();
+
+		c = get_cpu_slab(s, cpu);
+		l = &c->list;
+		page->list = l;
+
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+	} else {
+#ifdef CONFIG_NUMA
+		struct kmem_cache_node *n;
+
+		n = s->node[slqb_page_to_nid(page)];
+		l = &n->list;
+		page->list = l;
+
+		spin_lock(&n->list_lock);
+		l->nr_slabs++;
+		l->nr_partial++;
+		list_add(&page->lru, &l->partial);
+		slqb_stat_inc(l, ALLOC);
+		slqb_stat_inc(l, ALLOC_SLAB_NEW);
+		object = __cache_list_get_page(s, l);
+		spin_unlock(&n->list_lock);
+#endif
+	}
+	VM_BUG_ON(!object);
+	return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+		return node;
+	if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+		return cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		return slab_node(current->mempolicy);
+	return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static void *__remote_slab_alloc_node(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	struct kmem_cache_node *n;
+	struct kmem_cache_list *l;
+	void *object;
+
+	n = s->node[node];
+	if (unlikely(!n)) /* node has no memory */
+		return NULL;
+	l = &n->list;
+
+	spin_lock(&n->list_lock);
+
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			spin_unlock(&n->list_lock);
+			return __slab_alloc_page(s, gfpflags, node);
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	spin_unlock(&n->list_lock);
+	return object;
+}
+
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfpflags);
+
+	object = __remote_slab_alloc_node(s, gfpflags, node);
+	if (likely(object || (gfpflags & __GFP_THISNODE)))
+		return object;
+
+	zonelist = node_zonelist(slab_node(current->mempolicy), gfpflags);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		if (!cpuset_zone_allowed_hardwall(zone, gfpflags))
+			continue;
+
+		node = zone_to_nid(zone);
+		object = __remote_slab_alloc_node(s, gfpflags, node);
+		if (likely(object))
+			return object;
+	}
+	return NULL;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+try_remote:
+		return __remote_slab_alloc(s, gfpflags, node);
+	}
+#endif
+
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+	l = &c->list;
+	object = __cache_list_get_object(s, l);
+	if (unlikely(!object)) {
+		object = __cache_list_get_page(s, l);
+		if (unlikely(!object)) {
+			object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+			if (unlikely(!object)) {
+				node = numa_node_id();
+				goto try_remote;
+			}
+#endif
+			return object;
+		}
+	}
+	if (likely(object))
+		slqb_stat_inc(l, ALLOC);
+	return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, int node, void *addr)
+{
+	void *object;
+	unsigned long flags;
+
+again:
+	local_irq_save(flags);
+	object = __slab_alloc(s, gfpflags, node);
+	local_irq_restore(flags);
+
+	if (unlikely(slab_debug(s)) && likely(object)) {
+		if (unlikely(!alloc_debug_processing(s, object, addr)))
+			goto again;
+	}
+
+	if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+		memset(object, 0, s->objsize);
+
+	return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+				gfp_t gfpflags, void *caller)
+{
+	int node = -1;
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, gfpflags, node);
+#endif
+	return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+	return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *src;
+	struct kmem_cache_list *dst;
+	unsigned int nr;
+	int set;
+
+	src = &c->rlist;
+	nr = src->nr;
+	if (unlikely(!nr))
+		return;
+
+#ifdef CONFIG_SLQB_STATS
+	{
+		struct kmem_cache_list *l = &c->list;
+
+		slqb_stat_inc(l, FLUSH_RFREE_LIST);
+		slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+	}
+#endif
+
+	dst = c->remote_cache_list;
+
+	spin_lock(&dst->remote_free.lock);
+
+	if (!dst->remote_free.list.head)
+		dst->remote_free.list.head = src->head;
+	else
+		set_freepointer(s, dst->remote_free.list.tail, src->head);
+	dst->remote_free.list.tail = src->tail;
+
+	src->head = NULL;
+	src->tail = NULL;
+	src->nr = 0;
+
+	if (dst->remote_free.list.nr < slab_freebatch(s))
+		set = 1;
+	else
+		set = 0;
+
+	dst->remote_free.list.nr += nr;
+
+	if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+		dst->remote_free_check = 1;
+
+	spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+				struct slqb_page *page, void *object,
+				struct kmem_cache_cpu *c)
+{
+	struct kmlist *r;
+
+	/*
+	 * Our remote free list corresponds to a different list. Must
+	 * flush it and switch.
+	 */
+	if (page->list != c->remote_cache_list) {
+		flush_remote_free_cache(s, c);
+		c->remote_cache_list = page->list;
+	}
+
+	r = &c->rlist;
+	if (!r->head)
+		r->head = object;
+	else
+		set_freepointer(s, r->tail, object);
+	set_freepointer(s, object, NULL);
+	r->tail = object;
+	r->nr++;
+
+	if (unlikely(r->nr > slab_freebatch(s)))
+		flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_list *l;
+	int thiscpu = smp_processor_id();
+
+	c = get_cpu_slab(s, thiscpu);
+	l = &c->list;
+
+	slqb_stat_inc(l, FREE);
+
+	if (!NUMA_BUILD || !slab_numa(s) ||
+			likely(slqb_page_to_nid(page) == numa_node_id())) {
+		/*
+		 * Freeing fastpath. Collects all local-node objects, not
+		 * just those allocated from our per-CPU list. This allows
+		 * fast transfer of objects from one CPU to another within
+		 * a given node.
+		 */
+		set_freepointer(s, object, l->freelist.head);
+		l->freelist.head = object;
+		if (!l->freelist.nr)
+			l->freelist.tail = object;
+		l->freelist.nr++;
+
+		if (unlikely(l->freelist.nr > slab_hiwater(s)))
+			flush_free_list(s, l);
+
+	} else {
+#ifdef CONFIG_NUMA
+		/*
+		 * Freeing an object that was allocated on a remote node.
+		 */
+		slab_free_to_remote(s, page, object, c);
+		slqb_stat_inc(l, FREE_REMOTE);
+#endif
+	}
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+				struct slqb_page *page, void *object)
+{
+	unsigned long flags;
+
+	prefetchw(object);
+
+	debug_check_no_locks_freed(object, s->objsize);
+	if (likely(object) && unlikely(slab_debug(s))) {
+		if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+			return;
+	}
+
+	local_irq_save(flags);
+	__slab_free(s, page, object);
+	local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+	struct slqb_page *page = NULL;
+
+	if (slab_numa(s))
+		page = virt_to_head_slqb_page(object);
+	slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static int slab_order(int size, int max_order, int frac)
+{
+	int order;
+
+	if (fls(size - 1) <= PAGE_SHIFT)
+		order = 0;
+	else
+		order = fls(size - 1) - PAGE_SHIFT;
+
+	while (order <= max_order) {
+		unsigned long slab_size = PAGE_SIZE << order;
+		unsigned long objects;
+		unsigned long waste;
+
+		objects = slab_size / size;
+		if (!objects)
+			continue;
+
+		waste = slab_size - (objects * size);
+
+		if (waste * frac <= slab_size)
+			break;
+
+		order++;
+	}
+
+	return order;
+}
+
+static int calculate_order(int size)
+{
+	int order;
+
+	/*
+	 * Attempt to find best configuration for a slab. This
+	 * works by first attempting to generate a layout with
+	 * the best configuration and backing off gradually.
+	 */
+	order = slab_order(size, 1, 4);
+	if (order <= 1)
+		return order;
+
+	/*
+	 * This size cannot fit in order-1. Allow bigger orders, but
+	 * forget about trying to save space.
+	 */
+	order = slab_order(size, MAX_ORDER, 0);
+	if (order <= MAX_ORDER)
+		return order;
+
+	return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+				unsigned long align, unsigned long size)
+{
+	/*
+	 * If the user wants hardware cache aligned objects then follow that
+	 * suggestion if the object is sufficiently large.
+	 *
+	 * The hardware cache alignment cannot override the specified
+	 * alignment though. If that is greater then use it.
+	 */
+	if (flags & SLAB_HWCACHE_ALIGN) {
+		unsigned long ralign = cache_line_size();
+
+		while (size <= ralign / 2)
+			ralign /= 2;
+		align = max(align, ralign);
+	}
+
+	if (align < ARCH_SLAB_MINALIGN)
+		align = ARCH_SLAB_MINALIGN;
+
+	return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s,
+				struct kmem_cache_list *l)
+{
+	l->cache		= s;
+	l->freelist.nr		= 0;
+	l->freelist.head	= NULL;
+	l->freelist.tail	= NULL;
+	l->nr_partial		= 0;
+	l->nr_slabs		= 0;
+	INIT_LIST_HEAD(&l->partial);
+
+#ifdef CONFIG_SMP
+	l->remote_free_check	= 0;
+	spin_lock_init(&l->remote_free.lock);
+	l->remote_free.list.nr	= 0;
+	l->remote_free.list.head = NULL;
+	l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+	memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+				struct kmem_cache_cpu *c)
+{
+	init_kmem_cache_list(s, &c->list);
+
+	c->colour_next		= 0;
+#ifdef CONFIG_SMP
+	c->rlist.nr		= 0;
+	c->rlist.head		= NULL;
+	c->rlist.tail		= NULL;
+	c->remote_cache_list	= NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s,
+				struct kmem_cache_node *n)
+{
+	spin_lock_init(&n->list_lock);
+	init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs. XXX: allocate dynamically (with bootmem maybe) */
+#ifdef CONFIG_SMP
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
+#endif
+#ifdef CONFIG_NUMA
+/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
+ * a static array */
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
+#ifdef CONFIG_NUMA
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+				int cpu)
+{
+	struct kmem_cache_cpu *c;
+
+	c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return NULL;
+
+	init_kmem_cache_cpu(s, c);
+	return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c) {
+			kmem_cache_free(&kmem_cpu_cache, c);
+			s->cpu_slab[cpu] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+
+		c = s->cpu_slab[cpu];
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(s, cpu);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+	init_kmem_cache_cpu(s, &s->cpu_slab);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = s->node[node];
+		if (n) {
+			kmem_cache_free(&kmem_node_cache, n);
+			s->node[node] = NULL;
+		}
+	}
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	int node;
+
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[node] = n;
+	}
+	return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+	return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+	unsigned long flags = s->flags;
+	unsigned long size = s->objsize;
+	unsigned long align = s->align;
+
+	/*
+	 * Determine if we can poison the object itself. If the user of
+	 * the slab may touch the object after free or before allocation
+	 * then we should never poison the object itself.
+	 */
+	if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+		s->flags |= __OBJECT_POISON;
+	else
+		s->flags &= ~__OBJECT_POISON;
+
+	/*
+	 * Round up object size to the next word boundary. We can only
+	 * place the free pointer at word boundaries and this determines
+	 * the possible location of the free pointer.
+	 */
+	size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+	/*
+	 * If we are Redzoning then check if there is some space between the
+	 * end of the object and the free pointer. If not then add an
+	 * additional word to have some bytes to store Redzone information.
+	 */
+	if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+		size += sizeof(void *);
+#endif
+
+	/*
+	 * With that we have determined the number of bytes in actual use
+	 * by the object. This is the potential offset to the free pointer.
+	 */
+	s->inuse = size;
+
+	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+		/*
+		 * Relocate free pointer after the object if it is not
+		 * permitted to overwrite the first word of the object on
+		 * kmem_cache_free.
+		 *
+		 * This is the case if we do RCU, have a constructor or
+		 * destructor or are poisoning the objects.
+		 */
+		s->offset = size;
+		size += sizeof(void *);
+	}
+
+#ifdef CONFIG_SLQB_DEBUG
+	if (flags & SLAB_STORE_USER) {
+		/*
+		 * Need to store information about allocs and frees after
+		 * the object.
+		 */
+		size += 2 * sizeof(struct track);
+	}
+
+	if (flags & SLAB_RED_ZONE) {
+		/*
+		 * Add some empty padding so that we can catch
+		 * overwrites from earlier objects rather than let
+		 * tracking information or the free pointer be
+		 * corrupted if an user writes before the start
+		 * of the object.
+		 */
+		size += sizeof(void *);
+	}
+#endif
+
+	/*
+	 * Determine the alignment based on various parameters that the
+	 * user specified and the dynamic determination of cache line size
+	 * on bootup.
+	 */
+	align = calculate_alignment(flags, align, s->objsize);
+
+	/*
+	 * SLQB stores one object immediately after another beginning from
+	 * offset 0. In order to align the objects we have to simply size
+	 * each object to conform to the alignment.
+	 */
+	size = ALIGN(size, align);
+	s->size = size;
+	s->order = calculate_order(size);
+
+	if (s->order < 0)
+		return 0;
+
+	s->allocflags = 0;
+	if (s->order)
+		s->allocflags |= __GFP_COMP;
+
+	if (s->flags & SLAB_CACHE_DMA)
+		s->allocflags |= SLQB_DMA;
+
+	if (s->flags & SLAB_RECLAIM_ACCOUNT)
+		s->allocflags |= __GFP_RECLAIMABLE;
+
+	/*
+	 * Determine the number of objects per slab
+	 */
+	s->objects = (PAGE_SIZE << s->order) / size;
+
+	s->freebatch = max(4UL*PAGE_SIZE / size,
+				min(256UL, 64*PAGE_SIZE / size));
+	if (!s->freebatch)
+		s->freebatch = 1;
+	s->hiwater = s->freebatch << 2;
+
+	return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+			const char *name, size_t size, size_t align,
+			unsigned long flags, void (*ctor)(void *), int alloc)
+{
+	unsigned int left_over;
+
+	memset(s, 0, kmem_size);
+	s->name = name;
+	s->ctor = ctor;
+	s->objsize = size;
+	s->align = align;
+	s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+	if (!calculate_sizes(s))
+		goto error;
+
+	if (!slab_debug(s)) {
+		left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+		s->colour_off = max(cache_line_size(), s->align);
+		s->colour_range = left_over;
+	} else {
+		s->colour_off = 0;
+		s->colour_range = 0;
+	}
+
+	down_write(&slqb_lock);
+	if (likely(alloc)) {
+		if (!alloc_kmem_cache_nodes(s))
+			goto error_lock;
+
+		if (!alloc_kmem_cache_cpus(s))
+			goto error_nodes;
+	}
+
+	sysfs_slab_add(s);
+	list_add(&s->list, &slab_caches);
+	up_write(&slqb_lock);
+
+	return 1;
+
+error_nodes:
+	free_kmem_cache_nodes(s);
+error_lock:
+	up_write(&slqb_lock);
+error:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+	return 0;
+}
+
+/**
+ * kmem_ptr_validate - check if an untrusted pointer might be a slab entry.
+ * @s: the cache we're checking against
+ * @ptr: pointer to validate
+ *
+ * This verifies that the untrusted pointer looks sane;
+ * it is _not_ a guarantee that the pointer is actually
+ * part of the slab cache in question, but it at least
+ * validates that the pointer can be dereferenced and
+ * looks half-way sane.
+ *
+ * Currently only used for dentry validation.
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *ptr)
+{
+	unsigned long addr = (unsigned long)ptr;
+	struct slqb_page *page;
+
+	if (unlikely(addr < PAGE_OFFSET))
+		goto out;
+	if (unlikely(addr > (unsigned long)high_memory - s->size))
+		goto out;
+	if (unlikely(!IS_ALIGNED(addr, s->align)))
+		goto out;
+	if (unlikely(!kern_addr_valid(addr)))
+		goto out;
+	if (unlikely(!kern_addr_valid(addr + s->size - 1)))
+		goto out;
+	if (unlikely(!pfn_valid(addr >> PAGE_SHIFT)))
+		goto out;
+	page = virt_to_head_slqb_page(ptr);
+	if (unlikely(!(page->flags & PG_SLQB_BIT)))
+		goto out;
+	if (unlikely(page->list->cache != s))
+		goto out;
+	return 1;
+out:
+	return 0;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+	return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+	return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+	int cpu;
+
+	down_write(&slqb_lock);
+	list_del(&s->list);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		flush_free_list_all(s, l);
+		flush_remote_free_cache(s, c);
+	}
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+		claim_remote_free_list(s, l);
+#endif
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);
+		WARN_ON(l->nr_slabs);
+		WARN_ON(l->nr_partial);
+	}
+
+	free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+		struct kmem_cache_list *l;
+
+		n = s->node[node];
+		if (!n)
+			continue;
+		l = &n->list;
+
+		claim_remote_free_list(s, l);
+		flush_free_list_all(s, l);
+
+		WARN_ON(l->freelist.nr);loc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+				const char *name, int size, gfp_t gfp_flags)
+{
+	unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+	if (gfp_flags & SLQB_DMA)
+		flags |= SLAB_CACHE_DMA;
+
+	kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+	return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+#if L1_CACHE_BYTES < 64
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+#else
+	7,
+	7,
+	7,
+	7,
+#endif
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+#if L1_CACHE_BYTES < 128
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+#else
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1,
+	-1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+	int index;
+
+#if L1_CACHE_BYTES >= 128
+	if (size <= 128) {
+#else
+	if (size <= 192) {
+#endif
+		if (unlikely(!size))
+			return ZERO_SIZE_PTR;
+
+		index = size_index[(size - 1) / 8];
+	} else
+		index = fls(size - 1);
+
+	if (unlikely((flags & SLQB_DMA)))
+		return &kmalloc_caches_dma[index];
+	else
+		return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+	struct slqb_page *page;
+	struct kmem_cache *s;
+
+	BUG_ON(!object);
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	page = virt_to_head_slqb_page(object);
+	BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+	s = page->list->cache;
+
+	/*
+	 * Debugging requires use of the padding between object
+	 * and whatever may come after it.
+	 */
+	if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+		return s->objsize;
+
+	/*
+	 * If we have the need to store the freelist pointer
+	 * back there or track user information then we can
+	 * only use the space before that information.
+	 */
+	if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+		return s->inuse;
+
+	/*
+	 * Else we can use all the padding etc for the allocation
+	 */
+	return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+	struct kmem_cache *s;
+	struct slqb_page *page;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	page = virt_to_head_slqb_page(object);
+	s = page->list->cache;
+
+	slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = arg;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+	claim_remote_free_list(s, l);
+#endif
+	flush_free_list(s, l);
+#ifdef CONFIG_SMP
+	flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(node, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+		struct kmem_cache_list *l;
+
+		n = s->node[node];
+		if (!n)
+			continue;
+		l = &n->list;
+
+		spin_lock_irq(&n->list_lock);
+		claim_remote_free_list(s, l);
+		flush_free_list(s, l);
+		spin_unlock_irq(&n->list_lock);
+	}
+#endif
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+	int cpu = smp_processor_id();
+	struct kmem_cache *s;
+	long phase = (long)arg;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (phase == 0) {
+			flush_free_list_all(s, l);
+			flush_remote_free_cache(s, c);
+		}
+
+		if (phase == 1) {
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+		}
+	}
+}
+
+static void kmem_cache_reap(void)
+{
+	struct kmem_cache *s;
+	int node;
+
+	down_read(&slqb_lock);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+	on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		for_each_node_state(node, N_NORMAL_MEMORY) {
+			struct kmem_cache_node *n;
+			struct kmem_cache_list *l;
+
+			n = s->node[node];
+			if (!n)
+				continue;
+			l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list_all(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+	}
+	up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+	struct delayed_work *work =
+		container_of(w, struct delayed_work, work);
+	struct kmem_cache *s;
+
+	if (!down_read_trylock(&slqb_lock))
+		goto out;
+
+	list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+		int node = numa_node_id();
+		struct kmem_cache_node *n = s->node[node];
+
+		if (n) {
+			struct kmem_cache_list *l = &n->list;
+
+			spin_lock_irq(&n->list_lock);
+			claim_remote_free_list(s, l);
+			flush_free_list(s, l);
+			spin_unlock_irq(&n->list_lock);
+		}
+#endif
+
+		local_irq_disable();
+		kmem_cache_trim_percpu(s);
+		local_irq_enable();
+	}
+
+	up_read(&slqb_lock);
+out:
+	schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && cache_trim_work->work.func == NULL) {
+		INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+		schedule_delayed_work_on(cpu, cache_trim_work,
+					__round_jiffies_relative(HZ, cpu));
+	}
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+
+	return 0;
+}
+device_initcall(cpucache_init);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+	kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+	/* XXX: should release structures, see CPU offline comment */
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+	struct kmem_cache *s;
+	struct kmem_cache_node *n;
+	struct memory_notify *marg = arg;
+	int nid = marg->status_change_nid;
+	int ret = 0;
+
+	/*
+	 * If the node's memory is already available, then kmem_cache_node is
+	 * already created. Nothing to do.
+	 */
+	if (nid < 0)
+		return 0;
+
+	/*
+	 * We are bringing a node online. No memory is availabe yet. We must
+	 * allocate a kmem_cache_node structure in order to bring the node
+	 * online.
+	 */
+	down_write(&slqb_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * XXX: kmem_cache_alloc_node will fallback to other nodes
+		 *      since memory is not yet available from the node that
+		 *      is brought up.
+		 */
+		if (s->node[nid]) /* could be lefover from last online */
+			continue;
+		n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+		if (!n) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		init_kmem_cache_node(s, n);
+		s->node[nid] = n;
+	}
+out:
+	up_write(&slqb_lock);
+	return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = slab_mem_going_online_callback(arg);
+		break;
+	case MEM_GOING_OFFLINE:
+		slab_mem_going_offline_callback(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		slab_mem_offline_callback(arg);
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ *			Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+	int i;
+	unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+	/*
+	 * All the ifdefs are rather ugly here, but it's just the setup code,
+	 * so it doesn't have to be too readable :)
+	 */
+#ifdef CONFIG_SMP
+	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
+	kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+			kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+	kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+			sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+			sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(i) {
+		struct kmem_cache_cpu *c;
+
+		c = &per_cpu(kmem_cache_cpus, i);
+		init_kmem_cache_cpu(&kmem_cache_cache, c);
+		kmem_cache_cache.cpu_slab[i] = c;
+
+		c = &per_cpu(kmem_cpu_cpus, i);
+		init_kmem_cache_cpu(&kmem_cpu_cache, c);
+		kmem_cpu_cache.cpu_slab[i] = c;
+
+#ifdef CONFIG_NUMA
+		c = &per_cpu(kmem_node_cpus, i);
+		init_kmem_cache_cpu(&kmem_node_cache, c);
+		kmem_node_cache.cpu_slab[i] = c;
+#endif
+	}
+#else
+	init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+	for_each_node_state(i, N_NORMAL_MEMORY) {
+		struct kmem_cache_node *n;
+
+		n = &per_cpu(kmem_cache_nodes, i);
+		init_kmem_cache_node(&kmem_cache_cache, n);
+		kmem_cache_cache.node[i] = n;
+
+		n = &per_cpu(kmem_cpu_nodes, i);
+		init_kmem_cache_node(&kmem_cpu_cache, n);
+		kmem_cpu_cache.node[i] = n;
+
+		n = &per_cpu(kmem_node_nodes, i);
+		init_kmem_cache_node(&kmem_node_cache, n);
+		kmem_node_cache.node[i] = n;
+	}
+#endif
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+		open_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[1],
+				"kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+	if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+		open_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[2],
+				"kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		open_kmalloc_cache(&kmalloc_caches[i],
+				"kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+		open_kmalloc_cache(&kmalloc_caches_dma[i],
+				"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+	}
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+	/* Provide the correct kmalloc names now that the caches are up */
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+		kmalloc_caches[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+		kmalloc_caches_dma[i].name =
+			kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+	}
+
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+	hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+	/*
+	 * smp_init() has not yet been called, so no worries about memory
+	 * ordering here (eg. slab_is_available vs numa_platform)
+	 */
+	__slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+		size_t align, unsigned long flags)
+{
+	struct kmem_cache *tmp;
+
+	/*
+	 * Sanity checks... these are all serious usage bugs.
+	 */
+	if (!name || in_interrupt() || (size < sizeof(void *))) {
+		printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+				name);
+		dump_stack();
+
+		return 0;
+	}
+
+	down_read(&slqb_lock);
+
+	list_for_each_entry(tmp, &slab_caches, list) {
+		char x;
+		int res;
+
+		/*
+		 * This happens when the module gets unloaded and doesn't
+		 * destroy its slab cache and no-one else reuses the vmalloc
+		 * area of the module.  Print a warning.
+		 */
+		res = probe_kernel_address(tmp->name, x);
+		if (res) {
+			printk(KERN_ERR
+			       "SLAB: cache with size %d has lost its name\n",
+			       tmp->size);
+			continue;
+		}
+
+		if (!strcmp(tmp->name, name)) {
+			printk(KERN_ERR
+			       "kmem_cache_create(): duplicate cache %s\n", name);
+			dump_stack();
+			up_read(&slqb_lock);
+
+			return 0;
+		}
+	}
+
+	up_read(&slqb_lock);
+
+	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
+	if (flags & SLAB_DESTROY_BY_RCU)
+		WARN_ON(flags & SLAB_POISON);
+
+	return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+		size_t align, unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *s;
+
+	if (!kmem_cache_create_ok(name, size, align, flags))
+		goto err;
+
+	s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+	if (!s)
+		goto err;
+
+	if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+		return s;
+
+	kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+	if (flags & SLAB_PANIC)
+		panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct kmem_cache *s;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_write(&slqb_lock);
+		list_for_each_entry(s, &slab_caches, list) {
+			if (s->cpu_slab[cpu]) /* could be lefover last online */
+				continue;
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+			if (!s->cpu_slab[cpu]) {
+				up_read(&slqb_lock);
+				return NOTIFY_BAD;
+			}
+		}
+		up_write(&slqb_lock);
+		break;
+
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+		per_cpu(cache_trim_work, cpu).work.func = NULL;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		/*
+		 * XXX: Freeing here doesn't work because objects can still be
+		 * on this CPU's list. periodic timer needs to check if a CPU
+		 * is offline and then try to cleanup from there. Same for node
+		 * offline.
+		 */
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+	.notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+	struct kmem_cache *s;
+	int node = -1;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+#ifdef CONFIG_NUMA
+	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+		node = alternate_nid(s, flags, node);
+#endif
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+				unsigned long caller)
+{
+	struct kmem_cache *s;
+
+	s = get_slab(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(s)))
+		return s;
+
+	return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+	struct kmem_cache *s;
+	spinlock_t lock;
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+	unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+	unsigned long nr_slabs;
+	unsigned long nr_partial;
+	unsigned long nr_inuse;
+	struct stats_gather *gather = arg;
+	int cpu = smp_processor_id();
+	struct kmem_cache *s = gather->s;
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_list *l = &c->list;
+	struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+	int i;
+#endif
+
+	nr_slabs = l->nr_slabs;
+	nr_partial = l->nr_partial;
+	nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+	list_for_each_entry(page, &l->partial, lru) {
+		nr_inuse += page->inuse;
+	}
+
+	spin_lock(&gather->lock);
+	gather->nr_slabs += nr_slabs;
+	gather->nr_partial += nr_partial;
+	gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+	for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+		gather->stats[i] += l->stats[i];
+#endif
+	spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+	int node;
+#endif
+
+	memset(stats, 0, sizeof(struct stats_gather));
+	stats->s = s;
+	spin_lock_init(&stats->lock);
+
+	down_read(&slqb_lock); /* hold off hotplug */
+
+	on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_list *l = &n->list;
+		struct slqb_page *page;
+		unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+		int i;
+#endif
+
+		spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+		for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+			stats->stats[i] += l->stats[i];
+#endif
+		stats->nr_slabs += l->nr_slabs;
+		stats->nr_partial += l->nr_partial;
+		stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+		list_for_each_entry(page, &l->partial, lru) {
+			stats->nr_inuse += page->inuse;
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+#endif
+
+	up_read(&slqb_lock);
+
+	stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name	    <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slqb_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct stats_gather stats;
+	struct kmem_cache *s;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	gather_stats(s, &stats);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+			stats.nr_objects, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+			slab_freebatch(s), 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+			stats.nr_slabs, 0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+	.open		= slabinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+	proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+			&proc_slabinfo_operations);
+	return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kmem_cache *s, char *buf);
+	ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+	static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+	static struct slab_attribute _name##_attr =  \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+	if (s->ctor) {
+		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+		return n + sprintf(buf + n, "\n");
+	}
+	return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+	struct stats_gather stats;
+
+	gather_stats(s, &stats);
+
+	return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long hiwater;
+	int err;
+
+	err = strict_strtol(buf, 10, &hiwater);
+	if (err)
+		return err;
+
+	if (hiwater < 0)
+		return -EINVAL;
+
+	s->hiwater = hiwater;
+
+	return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	long freebatch;
+	int err;
+
+	err = strict_strtol(buf, 10, &freebatch);
+	if (err)
+		return err;
+
+	if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+		return -EINVAL;
+
+	s->freebatch = freebatch;
+
+	return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+	struct stats_gather stats;
+	int len;
+#ifdef CONFIG_SMP
+	int cpu;
+#endif
+
+	gather_stats(s, &stats);
+
+	len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+		struct kmem_cache_list *l = &c->list;
+
+		if (len < PAGE_SIZE - 20)
+			len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) 					\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)	\
+{								\
+	return show_stat(s, buf, si);				\
+}								\
+SLAB_ATTR_RO(text);						\
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+	&slab_size_attr.attr,
+	&object_size_attr.attr,
+	&objs_per_slab_attr.attr,
+	&order_attr.attr,
+	&objects_attr.attr,
+	&total_objects_attr.attr,
+	&slabs_attr.attr,
+	&ctor_attr.attr,
+	&align_attr.attr,
+	&hwcache_align_attr.attr,
+	&reclaim_account_attr.attr,
+	&destroy_by_rcu_attr.attr,
+	&red_zone_attr.attr,
+	&poison_attr.attr,
+	&store_user_attr.attr,
+	&hiwater_attr.attr,
+	&freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+	&cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+	&alloc_attr.attr,
+	&alloc_slab_fill_attr.attr,
+	&alloc_slab_new_attr.attr,
+	&free_attr.attr,
+	&free_remote_attr.attr,
+	&flush_free_list_attr.attr,
+	&flush_free_list_objects_attr.attr,
+	&flush_free_list_remote_attr.attr,
+	&flush_slab_partial_attr.attr,
+	&flush_slab_free_attr.attr,
+	&flush_rfree_list_attr.attr,
+	&flush_rfree_list_objects_attr.attr,
+	&claim_remote_list_attr.attr,
+	&claim_remote_list_objects_attr.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group slab_attr_group = {
+	.attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	err = attribute->show(s, buf);
+
+	return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t len)
+{
+	struct slab_attribute *attribute;
+	struct kmem_cache *s;
+	int err;
+
+	attribute = to_slab_attr(attr);
+	s = to_slab(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	err = attribute->store(s, buf, len);
+
+	return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+	.show = slab_attr_show,
+	.store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+	.sysfs_ops = &slab_sysfs_ops,
+	.release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+	struct kobj_type *ktype = get_ktype(kobj);
+
+	if (ktype == &slab_ktype)
+		return 1;
+	return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+	.filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+	int err;
+
+	if (!sysfs_available)
+		return 0;
+
+	s->kobj.kset = slab_kset;
+	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+	if (err) {
+		kobject_put(&s->kobj);
+		return err;
+	}
+
+	err = sysfs_create_group(&s->kobj, &slab_attr_group);
+	if (err)
+		return err;
+
+	kobject_uevent(&s->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kobject_uevent(&s->kobj, KOBJ_REMOVE);
+	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+	struct kmem_cache *s;
+	int err;
+
+	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+	if (!slab_kset) {
+		printk(KERN_ERR "Cannot register slab subsystem.\n");
+		return -ENOSYS;
+	}
+
+	down_write(&slqb_lock);
+
+	sysfs_available = 1;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		err = sysfs_slab_add(s);
+		if (err)
+			printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+						" to sysfs\n", s->name);
+	}
+
+	up_write(&slqb_lock);
+
+	return 0;
+}
+device_initcall(slab_sysfs_init);
+
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -65,6 +65,10 @@
 /* The following flags affect the page allocator grouping pages by mobility */
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL		/* Objects are reclaimable */
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
+
+/* Following flags should only be used by allocator specific flags */
+#define SLAB_ALLOC_PRIVATE	0x000000ffUL
+
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
@@ -150,6 +154,8 @@ size_t ksize(const void *);
  */
 #ifdef CONFIG_SLUB
 #include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
 #elif defined(CONFIG_SLOB)
 #include <linux/slob_def.h>
 #else
@@ -252,7 +258,7 @@ static inline void *kmem_cache_alloc_nod
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +276,7 @@ extern void *__kmalloc_track_caller(size
  * standard allocator where we care about the real place the memory
  * allocation request comes from.
  */
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page
 
 static inline struct page *virt_to_head_page(const void *x)
 {
+#ifdef virt_to_page_fast
+	struct page *page = virt_to_page_fast(x);
+#else
 	struct page *page = virt_to_page(x);
+#endif
 	return compound_head(page);
 }
 
Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <ming.m.lin@intel.com> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+	char *name;
+	int align, cache_dma, destroy_by_rcu;
+	int hwcache_align, object_size, objs_per_slab;
+	int slab_size, store_user;
+	int order, poison, reclaim_account, red_zone;
+	int batch;
+	unsigned long objects, slabs, total_objects;
+	unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+	unsigned long free, free_remote;
+	unsigned long claim_remote_list, claim_remote_list_objects;
+	unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+	unsigned long flush_rfree_list, flush_rfree_list_objects;
+	unsigned long flush_slab_free, flush_slab_partial;
+	int numa[MAX_NODES];
+	int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+	va_list ap;
+
+	va_start(ap, x);
+	vfprintf(stderr, x, ap);
+	va_end(ap);
+	exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+	printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"-A|--activity          Most active slabs first\n"
+		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-D|--display-active    Switch line format to activity\n"
+		"-e|--empty             Show empty slabs\n"
+		"-h|--help              Show usage information\n"
+		"-i|--inverted          Inverted list\n"
+		"-l|--slabs             Show slabs\n"
+		"-n|--numa              Show NUMA information\n"
+		"-o|--ops		Show kmem_cache_ops\n"
+		"-s|--shrink            Shrink slabs\n"
+		"-r|--report		Detailed report on single slabs\n"
+		"-S|--Size              Sort by size\n"
+		"-t|--tracking          Show alloc/free information\n"
+		"-T|--Totals            Show summary information\n"
+		"-v|--validate          Validate slabs\n"
+		"-z|--zero              Include empty slabs\n"
+		"\nValid debug options (FZPUT may be combined)\n"
+		"a / A          Switch on all debug options (=FZUP)\n"
+		"-              Switch off all debug options\n"
+		"f / F          Sanity Checks (SLAB_DEBUG_FREE)\n"
+		"z / Z          Redzoning\n"
+		"p / P          Poisoning\n"
+		"u / U          Tracking\n"
+		"t / T          Tracing\n"
+	);
+}
+
+unsigned long read_obj(const char *name)
+{
+	FILE *f = fopen(name, "r");
+
+	if (!f)
+		buffer[0] = 0;
+	else {
+		if (!fgets(buffer, sizeof(buffer), f))
+			buffer[0] = 0;
+		fclose(f);
+		if (buffer[strlen(buffer)] == '\n')
+			buffer[strlen(buffer)] = 0;
+	}
+	return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+	if (!read_obj(name))
+		return 0;
+
+	return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+	unsigned long result = 0;
+	char *p;
+
+	*x = NULL;
+
+	if (!read_obj(name)) {
+		x = NULL;
+		return 0;
+	}
+	result = strtoul(buffer, &p, 10);
+	while (*p == ' ')
+		p++;
+	if (*p)
+		*x = strdup(p);
+	return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+	char x[100];
+	FILE *f;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "w");
+	if (!f)
+		fatal("Cannot write to %s\n", x);
+
+	fprintf(f, "%d\n", n);
+	fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+	char x[100];
+	FILE *f;
+	size_t l;
+
+	snprintf(x, 100, "%s/%s", s->name, name);
+	f = fopen(x, "r");
+	if (!f) {
+		buffer[0] = 0;
+		l = 0;
+	} else {
+		l = fread(buffer, 1, sizeof(buffer), f);
+		buffer[l] = 0;
+		fclose(f);
+	}
+	return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+	unsigned long divisor = 1;
+	char trailer = 0;
+	int n;
+
+	if (value > 1000000000UL) {
+		divisor = 100000000UL;
+		trailer = 'G';
+	} else if (value > 1000000UL) {
+		divisor = 100000UL;
+		trailer = 'M';
+	} else if (value > 1000UL) {
+		divisor = 100;
+		trailer = 'K';
+	}
+
+	value /= divisor;
+	n = sprintf(buffer, "%ld",value);
+	if (trailer) {
+		buffer[n] = trailer;
+		n++;
+		buffer[n] = 0;
+	}
+	if (divisor != 1) {
+		memmove(buffer + n - 2, buffer + n - 3, 4);
+		buffer[n-2] = '.';
+		n++;
+	}
+	return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+	int node;
+	int nr;
+
+	memset(numa, 0, MAX_NODES * sizeof(int));
+
+	if (!t)
+		return;
+
+	while (*t == 'N') {
+		t++;
+		node = strtoul(t, &t, 10);
+		if (*t == '=') {
+			t++;
+			nr = strtoul(t, &t, 10);
+			numa[node] = nr;
+			if (node > highest_node)
+				highest_node = node;
+		}
+		while (*t == ' ')
+			t++;
+	}
+}
+
+void slab_validate(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+	if (show_activity)
+		printf("Name                   Objects      Alloc       Free   %%Fill %%New  "
+			"FlushR %%FlushR FlushR_Objs O\n");
+	else
+		printf("Name                   Objects Objsize    Space "
+			" O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+	return 	s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+	return 	s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+	int node;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (!highest_node) {
+		printf("\n%s: No NUMA information available.\n", s->name);
+		return;
+	}
+
+	if (skip_zero && !s->slabs)
+		return;
+
+	if (!line) {
+		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		for(node = 0; node <= highest_node; node++)
+			printf(" %4d", node);
+		printf("\n----------------------");
+		for(node = 0; node <= highest_node; node++)
+			printf("-----");
+		printf("\n");
+	}
+	printf("%-21s ", mode ? "All slabs" : s->name);
+	for(node = 0; node <= highest_node; node++) {
+		char b[20];
+
+		store_size(b, s->numa[node]);
+		printf(" %4s", b);
+	}
+	printf("\n");
+	if (mode) {
+		printf("%-21s ", "Partial slabs");
+		for(node = 0; node <= highest_node; node++) {
+			char b[20];
+
+			store_size(b, s->numa_partial[node]);
+			printf(" %4s", b);
+		}
+		printf("\n");
+	}
+	line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+	printf("\n%s: Kernel object allocation\n", s->name);
+	printf("-----------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "alloc_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+	printf("\n%s: Kernel object freeing\n", s->name);
+	printf("------------------------------------------------------------------------\n");
+	if (read_slab_obj(s, "free_calls"))
+		printf(buffer);
+	else
+		printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (read_slab_obj(s, "ops")) {
+		printf("\n%s: kmem_cache operations\n", s->name);
+		printf("--------------------------------------------\n");
+		printf(buffer);
+	} else
+		printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+	if (x)
+		return "On ";
+	return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+	unsigned long total_alloc;
+	unsigned long total_free;
+	unsigned long total;
+
+	total_alloc = s->alloc;
+	total_free = s->free;
+
+	if (!total_alloc)
+		return;
+
+	printf("\n");
+	printf("Slab Perf Counter\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+		total_alloc,
+		s->alloc_slab_fill, s->alloc_slab_new);
+	printf("Free:  %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+		total_free,
+		s->flush_slab_partial,
+		s->flush_slab_free,
+		s->free_remote);
+	printf("Claim: %8lu, objects %8lu\n",
+		s->claim_remote_list,
+		s->claim_remote_list_objects);
+	printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+		s->flush_free_list,
+		s->flush_free_list_objects,
+		s->flush_free_list_remote);
+	printf("FlushR:%8lu, objects %8lu\n",
+		s->flush_rfree_list,
+		s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	printf("\nSlabcache: %-20s  Order : %2d Objects: %lu\n",
+		s->name, s->order, s->objects);
+	if (s->hwcache_align)
+		printf("** Hardware cacheline aligned\n");
+	if (s->cache_dma)
+		printf("** Memory is allocated in a special DMA zone\n");
+	if (s->destroy_by_rcu)
+		printf("** Slabs are destroyed via RCU\n");
+	if (s->reclaim_account)
+		printf("** Reclaim accounting active\n");
+
+	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
+	printf("------------------------------------------------------------------------\n");
+	printf("Object : %7d  Total  : %7ld   Sanity Checks : %s  Total: %7ld\n",
+			s->object_size, s->slabs, "N/A",
+			s->slabs * (page_size << s->order));
+	printf("SlabObj: %7d  Full   : %7s   Redzoning     : %s  Used : %7ld\n",
+			s->slab_size, "N/A",
+			onoff(s->red_zone), s->objects * s->object_size);
+	printf("SlabSiz: %7d  Partial: %7s   Poisoning     : %s  Loss : %7ld\n",
+			page_size << s->order, "N/A", onoff(s->poison),
+			s->slabs * (page_size << s->order) - s->objects * s->object_size);
+	printf("Loss   : %7d  CpuSlab: %7s   Tracking      : %s  Lalig: %7ld\n",
+			s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+			(s->slab_size - s->object_size) * s->objects);
+	printf("Align  : %7d  Objects: %7d   Tracing       : %s  Lpadd: %7ld\n",
+			s->align, s->objs_per_slab, "N/A",
+			((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+			s->slabs);
+
+	ops(s);
+	show_tracking(s);
+	slab_numa(s, 1);
+	slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+	char size_str[20];
+	char flags[20];
+	char *p = flags;
+
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (actual_slabs == 1) {
+		report(s);
+		return;
+	}
+
+	if (skip_zero && !show_empty && !s->slabs)
+		return;
+
+	if (show_empty && s->slabs)
+		return;
+
+	store_size(size_str, slab_size(s));
+
+	if (!line++)
+		first_line();
+
+	if (s->cache_dma)
+		*p++ = 'd';
+	if (s->hwcache_align)
+		*p++ = 'A';
+	if (s->poison)
+		*p++ = 'P';
+	if (s->reclaim_account)
+		*p++ = 'a';
+	if (s->red_zone)
+		*p++ = 'Z';
+	if (s->store_user)
+		*p++ = 'U';
+
+	*p = 0;
+	if (show_activity) {
+		unsigned long total_alloc;
+		unsigned long total_free;
+
+		total_alloc = s->alloc;
+		total_free = s->free;
+
+		printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+			s->name, s->objects,
+			total_alloc, total_free,
+			total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+			total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+			s->flush_rfree_list,
+			s->flush_rfree_list * 100 / (total_alloc + total_free),
+			s->flush_rfree_list_objects,
+			s->order);
+	}
+	else
+		printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+			s->name, s->objects, s->object_size, size_str,
+			s->objs_per_slab, s->order,
+			s->slabs ? (s->objects * s->object_size * 100) /
+				(s->slabs * (page_size << s->order)) : 100,
+			s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+	if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+		return 1;
+
+	if (strcasecmp(opt, "a") == 0) {
+		sanity = 1;
+		poison = 1;
+		redzone = 1;
+		tracking = 1;
+		return 1;
+	}
+
+	for ( ; *opt; opt++)
+	 	switch (*opt) {
+		case 'F' : case 'f':
+			if (sanity)
+				return 0;
+			sanity = 1;
+			break;
+		case 'P' : case 'p':
+			if (poison)
+				return 0;
+			poison = 1;
+			break;
+
+		case 'Z' : case 'z':
+			if (redzone)
+				return 0;
+			redzone = 1;
+			break;
+
+		case 'U' : case 'u':
+			if (tracking)
+				return 0;
+			tracking = 1;
+			break;
+
+		case 'T' : case 't':
+			if (tracing)
+				return 0;
+			tracing = 1;
+			break;
+		default:
+			return 0;
+		}
+	return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+	if (s->objects > 0)
+		return 0;
+
+	/*
+	 * We may still have slabs even if there are no objects. Shrinking will
+	 * remove them.
+	 */
+	if (s->slabs != 0)
+		set_obj(s, "shrink", 1);
+
+	return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+	if (strcmp(s->name, "*") == 0)
+		return;
+
+	if (redzone && !s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+	}
+	if (!redzone && s->red_zone) {
+		if (slab_empty(s))
+			set_obj(s, "red_zone", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+	}
+	if (poison && !s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+	}
+	if (!poison && s->poison) {
+		if (slab_empty(s))
+			set_obj(s, "poison", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+	}
+	if (tracking && !s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 1);
+		else
+			fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+	}
+	if (!tracking && s->store_user) {
+		if (slab_empty(s))
+			set_obj(s, "store_user", 0);
+		else
+			fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+	}
+}
+
+void totals(void)
+{
+	struct slabinfo *s;
+
+	int used_slabs = 0;
+	char b1[20], b2[20], b3[20], b4[20];
+	unsigned long long max = 1ULL << 63;
+
+	/* Object size */
+	unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+	/* Number of partial slabs in a slabcache */
+	unsigned long long min_partial = max, max_partial = 0,
+				avg_partial, total_partial = 0;
+
+	/* Number of slabs in a slab cache */
+	unsigned long long min_slabs = max, max_slabs = 0,
+				avg_slabs, total_slabs = 0;
+
+	/* Size of the whole slab */
+	unsigned long long min_size = max, max_size = 0,
+				avg_size, total_size = 0;
+
+	/* Bytes used for object storage in a slab */
+	unsigned long long min_used = max, max_used = 0,
+				avg_used, total_used = 0;
+
+	/* Waste: Bytes used for alignment and padding */
+	unsigned long long min_waste = max, max_waste = 0,
+				avg_waste, total_waste = 0;
+	/* Number of objects in a slab */
+	unsigned long long min_objects = max, max_objects = 0,
+				avg_objects, total_objects = 0;
+	/* Waste per object */
+	unsigned long long min_objwaste = max,
+				max_objwaste = 0, avg_objwaste,
+				total_objwaste = 0;
+
+	/* Memory per object */
+	unsigned long long min_memobj = max,
+				max_memobj = 0, avg_memobj,
+				total_objsize = 0;
+
+	for (s = slabinfo; s < slabinfo + slabs; s++) {
+		unsigned long long size;
+		unsigned long used;
+		unsigned long long wasted;
+		unsigned long long objwaste;
+
+		if (!s->slabs || !s->objects)
+			continue;
+
+		used_slabs++;
+
+		size = slab_size(s);
+		used = s->objects * s->object_size;
+		wasted = size - used;
+		objwaste = s->slab_size - s->object_size;
+
+		if (s->object_size < min_objsize)
+			min_objsize = s->object_size;
+		if (s->slabs < min_slabs)
+			min_slabs = s->slabs;
+		if (size < min_size)
+			min_size = size;
+		if (wasted < min_waste)
+			min_waste = wasted;
+		if (objwaste < min_objwaste)
+			min_objwaste = objwaste;
+		if (s->objects < min_objects)
+			min_objects = s->objects;
+		if (used < min_used)
+			min_used = used;
+		if (s->slab_size < min_memobj)
+			min_memobj = s->slab_size;
+
+		if (s->object_size > max_objsize)
+			max_objsize = s->object_size;
+		if (s->slabs > max_slabs)
+			max_slabs = s->slabs;
+		if (size > max_size)
+			max_size = size;
+		if (wasted > max_waste)
+			max_waste = wasted;
+		if (objwaste > max_objwaste)
+			max_objwaste = objwaste;
+		if (s->objects > max_objects)
+			max_objects = s->objects;
+		if (used > max_used)
+			max_used = used;
+		if (s->slab_size > max_memobj)
+			max_memobj = s->slab_size;
+
+		total_slabs += s->slabs;
+		total_size += size;
+		total_waste += wasted;
+
+		total_objects += s->objects;
+		total_used += used;
+
+		total_objwaste += s->objects * objwaste;
+		total_objsize += s->objects * s->slab_size;
+	}
+
+	if (!total_objects) {
+		printf("No objects\n");
+		return;
+	}
+	if (!used_slabs) {
+		printf("No slabs\n");
+		return;
+	}
+
+	/* Per slab averages */
+	avg_slabs = total_slabs / used_slabs;
+	avg_size = total_size / used_slabs;
+	avg_waste = total_waste / used_slabs;
+
+	avg_objects = total_objects / used_slabs;
+	avg_used = total_used / used_slabs;
+
+	/* Per object object sizes */
+	avg_objsize = total_used / total_objects;
+	avg_objwaste = total_objwaste / total_objects;
+	avg_memobj = total_objsize / total_objects;
+
+	printf("Slabcache Totals\n");
+	printf("----------------\n");
+	printf("Slabcaches : %3d      Active: %3d\n",
+			slabs, used_slabs);
+
+	store_size(b1, total_size);store_size(b2, total_waste);
+	store_size(b3, total_waste * 100 / total_used);
+	printf("Memory used: %6s   # Loss   : %6s   MRatio:%6s%%\n", b1, b2, b3);
+
+	store_size(b1, total_objects);
+	printf("# Objects  : %6s\n", b1);
+
+	printf("\n");
+	printf("Per Cache    Average         Min         Max       Total\n");
+	printf("---------------------------------------------------------\n");
+
+	store_size(b1, avg_objects);store_size(b2, min_objects);
+	store_size(b3, max_objects);store_size(b4, total_objects);
+	printf("#Objects  %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_slabs);store_size(b2, min_slabs);
+	store_size(b3, max_slabs);store_size(b4, total_slabs);
+	printf("#Slabs    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_size);store_size(b2, min_size);
+	store_size(b3, max_size);store_size(b4, total_size);
+	printf("Memory    %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_used);store_size(b2, min_used);
+	store_size(b3, max_used);store_size(b4, total_used);
+	printf("Used      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	store_size(b1, avg_waste);store_size(b2, min_waste);
+	store_size(b3, max_waste);store_size(b4, total_waste);
+	printf("Loss      %10s  %10s  %10s  %10s\n",
+			b1,	b2,	b3,	b4);
+
+	printf("\n");
+	printf("Per Object   Average         Min         Max\n");
+	printf("---------------------------------------------\n");
+
+	store_size(b1, avg_memobj);store_size(b2, min_memobj);
+	store_size(b3, max_memobj);
+	printf("Memory    %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+	store_size(b1, avg_objsize);store_size(b2, min_objsize);
+	store_size(b3, max_objsize);
+	printf("User      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+
+	store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+	store_size(b3, max_objwaste);
+	printf("Loss      %10s  %10s  %10s\n",
+			b1,	b2,	b3);
+}
+
+void sort_slabs(void)
+{
+	struct slabinfo *s1,*s2;
+
+	for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+		for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+			int result;
+
+			if (sort_size)
+				result = slab_size(s1) < slab_size(s2);
+			else if (sort_active)
+				result = slab_activity(s1) < slab_activity(s2);
+			else
+				result = strcasecmp(s1->name, s2->name);
+
+			if (show_inverted)
+				result = -result;
+
+			if (result > 0) {
+				struct slabinfo t;
+
+				memcpy(&t, s1, sizeof(struct slabinfo));
+				memcpy(s1, s2, sizeof(struct slabinfo));
+				memcpy(s2, &t, sizeof(struct slabinfo));
+			}
+		}
+	}
+}
+
+int slab_mismatch(char *slab)
+{
+	return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+	DIR *dir;
+	struct dirent *de;
+	struct slabinfo *slab = slabinfo;
+	char *p;
+	char *t;
+	int count;
+
+	if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+		fatal("SYSFS support for SLUB not active\n");
+
+	dir = opendir(".");
+	while ((de = readdir(dir))) {
+		if (de->d_name[0] == '.' ||
+			(de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+				continue;
+		switch (de->d_type) {
+		   case DT_DIR:
+			if (chdir(de->d_name))
+				fatal("Unable to access slab %s\n", slab->name);
+		   	slab->name = strdup(de->d_name);
+			slab->align = get_obj("align");
+			slab->cache_dma = get_obj("cache_dma");
+			slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+			slab->hwcache_align = get_obj("hwcache_align");
+			slab->object_size = get_obj("object_size");
+			slab->objects = get_obj("objects");
+			slab->total_objects = get_obj("total_objects");
+			slab->objs_per_slab = get_obj("objs_per_slab");
+			slab->order = get_obj("order");
+			slab->poison = get_obj("poison");
+			slab->reclaim_account = get_obj("reclaim_account");
+			slab->red_zone = get_obj("red_zone");
+			slab->slab_size = get_obj("slab_size");
+			slab->slabs = get_obj_and_str("slabs", &t);
+			decode_numa_list(slab->numa, t);
+			free(t);
+			slab->store_user = get_obj("store_user");
+			slab->batch = get_obj("batch");
+			slab->alloc = get_obj("alloc");
+			slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+			slab->alloc_slab_new = get_obj("alloc_slab_new");
+			slab->free = get_obj("free");
+			slab->free_remote = get_obj("free_remote");
+			slab->claim_remote_list = get_obj("claim_remote_list");
+			slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+			slab->flush_free_list = get_obj("flush_free_list");
+			slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+			slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+			slab->flush_rfree_list = get_obj("flush_rfree_list");
+			slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+			slab->flush_slab_free = get_obj("flush_slab_free");
+			slab->flush_slab_partial = get_obj("flush_slab_partial");
+			
+			chdir("..");
+			slab++;
+			break;
+		   default :
+			fatal("Unknown file type %lx\n", de->d_type);
+		}
+	}
+	closedir(dir);
+	slabs = slab - slabinfo;
+	actual_slabs = slabs;
+	if (slabs > MAX_SLABS)
+		fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+	struct slabinfo *slab;
+
+	for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+		if (show_numa)
+			slab_numa(slab, 0);
+		else if (show_track)
+			show_tracking(slab);
+		else if (validate)
+			slab_validate(slab);
+		else if (shrink)
+			slab_shrink(slab);
+		else if (set_debug)
+			slab_debug(slab);
+		else if (show_ops)
+			ops(slab);
+		else if (show_slab)
+			slabcache(slab);
+		else if (show_report)
+			report(slab);
+	}
+}
+
+struct option opts[] = {
+	{ "activity", 0, NULL, 'A' },
+	{ "debug", 2, NULL, 'd' },
+	{ "display-activity", 0, NULL, 'D' },
+	{ "empty", 0, NULL, 'e' },
+	{ "help", 0, NULL, 'h' },
+	{ "inverted", 0, NULL, 'i'},
+	{ "numa", 0, NULL, 'n' },
+	{ "ops", 0, NULL, 'o' },
+	{ "report", 0, NULL, 'r' },
+	{ "shrink", 0, NULL, 's' },
+	{ "slabs", 0, NULL, 'l' },
+	{ "track", 0, NULL, 't'},
+	{ "validate", 0, NULL, 'v' },
+	{ "zero", 0, NULL, 'z' },
+	{ "1ref", 0, NULL, '1'},
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+	int c;
+	int err;
+	char *pattern_source;
+
+	page_size = getpagesize();
+
+	while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+						opts, NULL)) != -1)
+		switch (c) {
+		case 'A':
+			sort_active = 1;
+			break;
+		case 'd':
+			set_debug = 1;
+			if (!debug_opt_scan(optarg))
+				fatal("Invalid debug option '%s'\n", optarg);
+			break;
+		case 'D':
+			show_activity = 1;
+			break;
+		case 'e':
+			show_empty = 1;
+			break;
+		case 'h':
+			usage();
+			return 0;
+		case 'i':
+			show_inverted = 1;
+			break;
+		case 'n':
+			show_numa = 1;
+			break;
+		case 'o':
+			show_ops = 1;
+			break;
+		case 'r':
+			show_report = 1;
+			break;
+		case 's':
+			shrink = 1;
+			break;
+		case 'l':
+			show_slab = 1;
+			break;
+		case 't':
+			show_track = 1;
+			break;
+		case 'v':
+			validate = 1;
+			break;
+		case 'z':
+			skip_zero = 0;
+			break;
+		case 'T':
+			show_totals = 1;
+			break;
+		case 'S':
+			sort_size = 1;
+			break;
+
+		default:
+			fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+	}
+
+	if (!show_slab && !show_track && !show_report
+		&& !validate && !shrink && !set_debug && !show_ops)
+			show_slab = 1;
+
+	if (argc > optind)
+		pattern_source = argv[optind];
+	else
+		pattern_source = ".*";
+
+	err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+	if (err)
+		fatal("%s: Invalid pattern '%s' code %d\n",
+			argv[0], pattern_source, err);
+	read_slab_dir();
+	if (show_totals)
+		totals();
+	else {
+		sort_slabs();
+		output_slabs();
+	}
+	return 0;
+}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-23 15:46 [patch] SLQB slab allocator (try 2) Nick Piggin
@ 2009-01-24  2:38 ` Zhang, Yanmin
  2009-01-26  8:48 ` Pekka Enberg
  1 sibling, 0 replies; 55+ messages in thread
From: Zhang, Yanmin @ 2009-01-24  2:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming,
	Christoph Lameter

On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> Hi,
> 
> Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up the
> code suggested by Ingo (haven't yet incorporated Ingo's last patch).
> 
> Should have fixed the crash reported by Yanmin (I was able to reproduce it
> on an ia64 system and fix it).
> 
> Significantly reduced static footprint of init arrays, thanks to Andi's
> suggestion.
> 
> Please consider for trial merge for linux-next.
When applying the patch to 2.6.29-rc2, I got:
[ymzhang@lkp-h01 linux-2.6.29-rc2_slqb0123]$ patch -p1<../patch-slqb0123
patching file include/linux/rcupdate.h
patching file include/linux/slqb_def.h
patching file init/Kconfig
patching file lib/Kconfig.debug
patching file mm/slqb.c
patch: **** malformed patch at line 4042: Index: linux-2.6/include/linux/slab.h



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-23 15:46 [patch] SLQB slab allocator (try 2) Nick Piggin
  2009-01-24  2:38 ` Zhang, Yanmin
@ 2009-01-26  8:48 ` Pekka Enberg
  2009-01-26  9:07   ` Peter Zijlstra
  2009-02-03 10:12   ` Mel Gorman
  1 sibling, 2 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-01-26  8:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin, Christoph Lameter

Hi Nick,

On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up the
> code suggested by Ingo (haven't yet incorporated Ingo's last patch).
> 
> Should have fixed the crash reported by Yanmin (I was able to reproduce it
> on an ia64 system and fix it).
> 
> Significantly reduced static footprint of init arrays, thanks to Andi's
> suggestion.
> 
> Please consider for trial merge for linux-next.

I merged a the one you resent privately as this one didn't apply at all.
The code is in topic/slqb/core branch of slab.git and should appear in
linux-next tomorrow.

Testing and especially performance testing is welcome. If any of the HPC
people are reading this, please do give SLQB a good beating as Nick's
plan is to replace both, SLAB and SLUB, with it in the long run. As
Christoph has expressed concerns over latency issues of SLQB, I suppose
it would be interesting to hear if it makes any difference to the
real-time folks.

		Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-26  8:48 ` Pekka Enberg
@ 2009-01-26  9:07   ` Peter Zijlstra
  2009-01-26  9:10     ` Peter Zijlstra
  2009-01-26 17:22     ` Christoph Lameter
  2009-02-03 10:12   ` Mel Gorman
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-01-26  9:07 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Mon, 2009-01-26 at 10:48 +0200, Pekka Enberg wrote:
> Christoph has expressed concerns over latency issues of SLQB, I suppose
> it would be interesting to hear if it makes any difference to the
> real-time folks.

I'll 'soon' take a stab at converting SLQB for -rt. Currently -rt is
SLAB only.

Then again, anything that does allocation is per definition not bounded
and not something we can have on latency critical paths -- so on that
respect its not interesting.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-26  9:07   ` Peter Zijlstra
@ 2009-01-26  9:10     ` Peter Zijlstra
  2009-01-26 17:22     ` Christoph Lameter
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-01-26  9:10 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Mon, 2009-01-26 at 10:07 +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-26 at 10:48 +0200, Pekka Enberg wrote:
> > Christoph has expressed concerns over latency issues of SLQB, I suppose
> > it would be interesting to hear if it makes any difference to the
> > real-time folks.
> 
> I'll 'soon' take a stab at converting SLQB for -rt. Currently -rt is
> SLAB only.
> 
> Then again, anything that does allocation is per definition not bounded
> and not something we can have on latency critical paths -- so on that
> respect its not interesting.

Before someone pipes up, _yes_ I do know about RT allocators and such.

No we don't do that in-kernel, other than through reservation mechanisms
like mempool -- and I'd rather extend that than try and get page reclaim
bounded.

Yes, I also know about folks doing RT paging, and no, I'm not wanting to
hear about that either ;-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-26  9:07   ` Peter Zijlstra
  2009-01-26  9:10     ` Peter Zijlstra
@ 2009-01-26 17:22     ` Christoph Lameter
  2009-01-27  9:07       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Lameter @ 2009-01-26 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Mon, 26 Jan 2009, Peter Zijlstra wrote:

> Then again, anything that does allocation is per definition not bounded
> and not something we can have on latency critical paths -- so on that
> respect its not interesting.

Well there is the problem in SLAB and SLQB that they *continue* to do
processing after an allocation. They defer queue cleaning. So your latency
critical paths are interrupted by the deferred queue processing. SLAB has
the awful habit of gradually pushing objects out of its queued (tried to
approximate the loss of cpu cache hotness over time). So for awhile you
get hit every 2 seconds with some free operations to the page allocator on
each cpu. If you have a lot of cpus then this may become an ongoing
operation. The slab pages end up in the page allocator queues which is
then occasionally pushed back to the buddy lists. Another relatively high
spike there.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-26 17:22     ` Christoph Lameter
@ 2009-01-27  9:07       ` Peter Zijlstra
  2009-01-27 20:21         ` Christoph Lameter
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-01-27  9:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Mon, 2009-01-26 at 12:22 -0500, Christoph Lameter wrote:
> On Mon, 26 Jan 2009, Peter Zijlstra wrote:
> 
> > Then again, anything that does allocation is per definition not bounded
> > and not something we can have on latency critical paths -- so on that
> > respect its not interesting.
> 
> Well there is the problem in SLAB and SLQB that they *continue* to do
> processing after an allocation. They defer queue cleaning. So your latency
> critical paths are interrupted by the deferred queue processing.

No they're not -- well, only if you let them that is, and then its your
own fault.

Remember, -rt is about being able to preempt pretty much everything. If
the userspace task has a higher priority than the timer interrupt, the
timer interrupt just gets to wait.

Yes there is a very small hardirq window where the actual interrupt
triggers, but all that that does is a wakeup and then its gone again.

>  SLAB has
> the awful habit of gradually pushing objects out of its queued (tried to
> approximate the loss of cpu cache hotness over time). So for awhile you
> get hit every 2 seconds with some free operations to the page allocator on
> each cpu. If you have a lot of cpus then this may become an ongoing
> operation. The slab pages end up in the page allocator queues which is
> then occasionally pushed back to the buddy lists. Another relatively high
> spike there.

Like Nick has been asking, can you give a solid test case that
demonstrates this issue?

I'm thinking getting git of those cross-bar queues hugely reduces that
problem.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-27  9:07       ` Peter Zijlstra
@ 2009-01-27 20:21         ` Christoph Lameter
  2009-02-03  2:04           ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Lameter @ 2009-01-27 20:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Tue, 27 Jan 2009, Peter Zijlstra wrote:

> > Well there is the problem in SLAB and SLQB that they *continue* to do
> > processing after an allocation. They defer queue cleaning. So your latency
> > critical paths are interrupted by the deferred queue processing.
>
> No they're not -- well, only if you let them that is, and then its your
> own fault.

So you can have priority over kernel threads.... Sounds very dangerous.

> Remember, -rt is about being able to preempt pretty much everything. If
> the userspace task has a higher priority than the timer interrupt, the
> timer interrupt just gets to wait.
>
> Yes there is a very small hardirq window where the actual interrupt
> triggers, but all that that does is a wakeup and then its gone again.

Never used -rt. This is an issue seen in regular kernels.

> >  SLAB has
> > the awful habit of gradually pushing objects out of its queued (tried to
> > approximate the loss of cpu cache hotness over time). So for awhile you
> > get hit every 2 seconds with some free operations to the page allocator on
> > each cpu. If you have a lot of cpus then this may become an ongoing
> > operation. The slab pages end up in the page allocator queues which is
> > then occasionally pushed back to the buddy lists. Another relatively high
> > spike there.
>
> Like Nick has been asking, can you give a solid test case that
> demonstrates this issue?

Run a loop reading tsc and see the variances?

In HPC apps a series of processors have to sync repeatedly in order to
complete operations. An event like cache cleaning can cause a disturbance
in one processor that delays this sync in the system as a whole. And
having it run at offsets separately on all processor causes the
disturbance to happen on one processor after another. In extreme cases all
syncs are delayed. We have seen this effect have a major delay on HPC app
performance.

Note that SLAB scans through all slab caches in the system and expires
queues that are active. The more slab caches there are and the more data
is in queues the longer the process takes.

> I'm thinking getting git of those cross-bar queues hugely reduces that
> problem.

The cross-bar queues are a significant problem because they mean operation
on objects that are relatively far away. So the time spend in cache
cleaning increases significantly. But as far as I can see SLQB also has
cross-bar queues like SLAB. SLUB does all necessary actions during the
actual allocation or free so there is no need to run cache cleaning.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-27 20:21         ` Christoph Lameter
@ 2009-02-03  2:04           ` Nick Piggin
  0 siblings, 0 replies; 55+ messages in thread
From: Nick Piggin @ 2009-02-03  2:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Pekka Enberg, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Wednesday 28 January 2009 07:21:58 Christoph Lameter wrote:
> On Tue, 27 Jan 2009, Peter Zijlstra wrote:
> > > Well there is the problem in SLAB and SLQB that they *continue* to do
> > > processing after an allocation. They defer queue cleaning. So your
> > > latency critical paths are interrupted by the deferred queue
> > > processing.
> >
> > No they're not -- well, only if you let them that is, and then its your
> > own fault.
>
> So you can have priority over kernel threads.... Sounds very dangerous.

Groan.


> > Like Nick has been asking, can you give a solid test case that
> > demonstrates this issue?
>
> Run a loop reading tsc and see the variances?
>
> In HPC apps a series of processors have to sync repeatedly in order to
> complete operations. An event like cache cleaning can cause a disturbance
> in one processor that delays this sync in the system as a whole. And
> having it run at offsets separately on all processor causes the
> disturbance to happen on one processor after another. In extreme cases all
> syncs are delayed. We have seen this effect have a major delay on HPC app
> performance.

Now we are starting to get somewhere slightly useful. Can we have more
details about this workload please? Was there a test program coded up
to run the same sequence of MPI operations? Or can they at least be
described?


> Note that SLAB scans through all slab caches in the system and expires
> queues that are active. The more slab caches there are and the more data
> is in queues the longer the process takes.

And the larger the number of nodes and CPUs, because SLAB can have so
many queues. This is not an issue with SLQB, so I don't think it will
be subject to the same magnitude of problem on your large machines.

Periodic cleaning in SLAB was never shown to be a problem with large
HPC clusters in the past, so that points to SGI's problem as being due
to explosion of queues in big machines rather than the whole concept
of periodic cleaning.

And we have lots of periodic things going on. Periodic journal flushing,
periodic dirty watermark checking, periodic timers, multiprocessor CPU
scheduler balancing etc etc. So no, I totally reject the assertion that
periodic slab cleaning is a showstopper. Without actual numbers or test
cases, I don't need to hear any more assertions in this vein.

(But note, numbers and/or test cases etc would be very very welcome
because I would like to tune SLQB performance on HPC as much as possible
and as I have already said, there are ways we can improve or mitigate
periodic trimming overheads).


> > I'm thinking getting git of those cross-bar queues hugely reduces that
> > problem.
>
> The cross-bar queues are a significant problem because they mean operation
> on objects that are relatively far away. So the time spend in cache
> cleaning increases significantly. But as far as I can see SLQB also has
> cross-bar queues like SLAB.

Well it doesn't.


> SLUB does all necessary actions during the
> actual allocation or free so there is no need to run cache cleaning.

And SLUB can actually leave free pages lying around that never get cleaned
up because of this. As I said, I have seen SLQB use less memory than SLUB
in some situations I assume because of this (although now that I think
about it, perhaps it was due to increased internal fragmentation in bigger
pages).


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-01-26  8:48 ` Pekka Enberg
  2009-01-26  9:07   ` Peter Zijlstra
@ 2009-02-03 10:12   ` Mel Gorman
  2009-02-03 10:36     ` Nick Piggin
  2009-02-03 18:58     ` Pekka Enberg
  1 sibling, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 10:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Mon, Jan 26, 2009 at 10:48:26AM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> > Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up the
> > code suggested by Ingo (haven't yet incorporated Ingo's last patch).
> > 
> > Should have fixed the crash reported by Yanmin (I was able to reproduce it
> > on an ia64 system and fix it).
> > 
> > Significantly reduced static footprint of init arrays, thanks to Andi's
> > suggestion.
> > 
> > Please consider for trial merge for linux-next.
> 
> I merged a the one you resent privately as this one didn't apply at all.
> The code is in topic/slqb/core branch of slab.git and should appear in
> linux-next tomorrow.
> 
> Testing and especially performance testing is welcome. If any of the HPC
> people are reading this, please do give SLQB a good beating as Nick's
> plan is to replace both, SLAB and SLUB, with it in the long run.As
> Christoph has expressed concerns over latency issues of SLQB, I suppose
> it would be interesting to hear if it makes any difference to the
> real-time folks.
> 

The HPC folks care about a few different workloads but speccpu is one that
shows up. I was in the position to run tests because I had put together
the test harness for a paper I spent the last month writing. This mail
shows a comparison between slab, slub and slqb for speccpu2006 running a
single thread and sysbench ranging clients from 1 to 4*num_online_cpus()
(16 in both cases). Additional tests were not run because just these two
take one day per kernel to complete. Results are ratios to the SLAB figures
and based on an x86-64 and ppc64 machine.

X86-64 Test machine
        CPU		AMD Phenom 9950 Quad-Core
        CPU Frequency   1.3GHz
        Physical CPUs	1 (4 cores)
        L1 Cache        64K Data, 64K Instruction per core
        L2 Cache        512K Unified per core
        L3 Cache        2048K Unified Shared per chip
        Main Memory     8 GB
        Mainboard       Gigabyte GA-MA78GM-S2H
        Machine Model   Custom built from parts

SPEC CPU 2006
-------------
Integer tests
SPEC test       slab         slub       slqb
400.perlbench   1.0000     1.0016     1.0064
401.bzip2       1.0000     0.9804     1.0011
403.gcc         1.0000     1.0023     0.9965
429.mcf         1.0000     1.0022     0.9963
445.gobmk       1.0000     0.9944     0.9986
456.hmmer       1.0000     0.9792     0.9701
458.sjeng       1.0000     0.9989     1.0133
462.libquantum  1.0000     0.9905     0.9981
464.h264ref     1.0000     0.9877     1.0058
471.omnetpp     1.0000     0.9893     1.0993
473.astar       1.0000     0.9542     0.9596
483.xalancbmk   1.0000     0.9547     0.9982
---------------
specint geomean 1.0000     0.9862     1.0031

Floating Point Tests
SPEC test       slab         slub       slqb
410.bwaves      1.0000     0.9939     1.0005
416.gamess      1.0000     1.0040     0.9984
433.milc        1.0000     0.9865     0.9865
434.zeusmp      1.0000     0.9810     0.9879
435.gromacs     1.0000     0.9854     1.0125
436.cactusADM   1.0000     1.0467     1.0294
437.leslie3d    1.0000     0.9846     0.9963
444.namd        1.0000     1.0000     1.0000
447.dealII      1.0000     0.9913     0.9957
450.soplex      1.0000     0.9940     1.0015
453.povray      1.0000     0.9904     1.0197
454.calculix    1.0000     0.9937     1.0000
459.GemsFDTD    1.0000     1.0061     1.0000
465.tonto       1.0000     0.9979     0.9989
470.lbm         1.0000     1.0099     1.0212
481.wrf         1.0000     1.0000     1.0045
482.sphinx3     1.0000     1.0047     1.0068
---------------
specfp geomean  1.0000     0.9981     1.0035

Sysbench - Postgres
-------------------
Client            slab       slub       slqb
     1          1.0000     0.9484     0.9804
     2          1.0000     1.0069     0.9994
     3          1.0000     1.0064     0.9994
     4          1.0000     0.9900     0.9904
     5          1.0000     1.0023     0.9869
     6          1.0000     1.0139     1.0069
     7          1.0000     0.9973     0.9991
     8          1.0000     1.0206     1.0197
     9          1.0000     0.9884     0.9817
    10          1.0000     0.9980     1.0135
    11          1.0000     0.9959     1.0164
    12          1.0000     0.9978     0.9953
    13          1.0000     1.0024     0.9942
    14          1.0000     0.9975     0.9808
    15          1.0000     0.9914     0.9933
    16          1.0000     0.9767     0.9726
--------------
Geometric mean  1.0000     0.9957     0.9955

On this particular x86-64, slab is on average faster for sysbench but
by a very small margin, less then 0.5%. I wasn't doing multiple runs for
each client number to see if this is within the noise but generally these
figures are quite stable. SPEC CPU is more interesting. Both SLUB and SLQB
regress on a number of the benchmarks although on average SLQB is very
marginally faster than SLAB (approx 0.3% faster) where SLUB is between
around 1% slower. Both SLUB and SLQB show big regressions on some tests:
hmmer, astar. omnetpp is also interesting in that SLUB regresses a little
and SLQB gains considerably. This is likely due to luck in cache placement.

Overall, while the regressions where they exist are troublesome, they are
also small and I strongly suspect there are far greater variances between
kernel releases due to changes other than the allocators. SLQB is the
winner, but by a minimal margin.

PPC64 Test Machine
        CPU              PPC970MP, altivec supported
        CPU Frequency    2.5GHz
        Physical CPUs 2 x dual core (4 cores in all)
        L1 Cache         32K Data, 64K Instruction per core
        L2 Cache         1024K Unified per core
        L3 Cache         N/a
        Main Memory      10GB
        Mainboard        Specific to the machine model

SPEC CPU 2006
-------------
Integer tests
SPEC test       slab         slub       slqb
400.perlbench   1.0000     1.0497     1.0497
401.bzip2       1.0000     1.0496     1.0489
403.gcc         1.0000     1.0509     1.0509
429.mcf         1.0000     1.0554     1.0549
445.gobmk       1.0000     1.0535     1.0556
456.hmmer       1.0000     1.0651     1.0566
458.sjeng       1.0000     1.0612     1.0564
462.libquantum  1.0000     1.0389     1.0396
464.h264ref     1.0000     1.0517     1.0503
471.omnetpp     1.0000     1.0555     1.0574
473.astar       1.0000     1.0508     1.0521
483.xalancbmk   1.0000     1.0594     1.0584
---------------
specint geomean 1.0000     1.0534     1.0525

Floating Point Tests
SPEC test       slab         slub       slqb
410.bwaves      1.0000     1.0381     1.0367
416.gamess      1.0000     1.0550     1.0550
433.milc        1.0000     1.0464     1.0450
434.zeusmp      1.0000     1.0510     1.0528
435.gromacs     1.0000     1.0461     1.0445
436.cactusADM   1.0000     1.0457     1.0450
437.leslie3d    1.0000     1.0437     1.0428
444.namd        1.0000     1.0482     1.0496
447.dealII      1.0000     1.0505     1.0505
450.soplex      1.0000     1.0522     1.0499
453.povray      1.0000     1.0513     1.0534
454.calculix    1.0000     1.0374     1.0357
459.GemsFDTD    1.0000     1.0465     1.0465
465.tonto       1.0000     1.0488     1.0456
470.lbm         1.0000     1.0438     1.0452
481.wrf         1.0000     1.0423     1.0429
482.sphinx3     1.0000     1.0464     1.0479
---------------
specfp geomean  1.0000     1.0467     1.0464

Sysbench - Postgres
-------------------
Client            slab       slub       slqb
     1          1.0000     1.0153     1.0051
     2          1.0000     1.0273     1.0269
     3          1.0000     1.0299     1.0234
     4          1.0000     1.0159     1.0146
     5          1.0000     1.0232     1.0264
     6          1.0000     1.0238     1.0088
     7          1.0000     1.0240     1.0076
     8          1.0000     1.0134     1.0024
     9          1.0000     1.0154     1.0077
    10          1.0000     1.0126     1.0009
    11          1.0000     1.0100     0.9933
    12          1.0000     1.0112     0.9993
    13          1.0000     1.0131     1.0035
    14          1.0000     1.0237     1.0071
    15          1.0000     1.0098     0.9997
    16          1.0000     1.0110     0.9994
Geometric mean  1.0000     1.0175     1.0078

Unlike x86-64, ppc64 sees a consistent gain with with SLUB or SLQB and
the difference between SLUB and SLQB negligible on average with speccpu.
However, it is very noticeable sysbench where SLUB is generally a win in the
1% range over SLQB and SLQB regressed very marginally in a few instances. The
three benchmarks showing odd behaviour on x86-64, hmmer aster and omnetpp,
do not show similar weirdness on ppc64.

Overall on ppc64, SLUB is the winner by a clearer margin.

Summary
-------
The decision on whether to use SLUB, SLAB or SLQB for either speccpu 2006
and sysbench is not very clearcut. The win on ppc64 would imply SLUB is the
way to go but it is also clear that different architectures will produce
different results which needs to be taken into account when trying to
reproduce figures from other people.  I stronly suspect variations of the
same architecture will also show different results. Things like omnetpp on
x86-64 imply that how cache is used is a factor but it's down to luck how
what the result will be.

Downsides
---------
The SPEC CPU tests were not parallelised so there is no indication as to which
allocator might scale better to the number of CPUs. The test machines were
not NUMA so there is also no indication on which might be better there. There
wasn't a means of measuring allocator jitter. Queue cleaning means that some
allocations might stall, something that SLUB is expected to be immune to,
doesn't have a clear way of measuring but something other allocators could
potentially "cheat" on by postponing cleaning to a time when performance
is not being measured.  I wasn't tracking to see which consumed the least
memory so we don't know which is more memory efficient either.

The OLTP workload results could indicate a downside with using sysbench
although it could also be hardware. The reports from the Intel guys have been
pretty clear-cut that SLUB is a loser but sysbench-postgres on these test
machines at least do not agree. Of course their results are perfectly valid
but the discrepency needs to be explained or there will be a disconnect
between developers and the performance people.  Something important is
missing that means sysbench-postgres *may* not be a reliable indicator of
TPC-C performance.  It could easily be down to the hardware as their tests
are on a mega-large machine with oodles of disks and probably NUMA where
the test machine used for this is a lot less respectable.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 10:12   ` Mel Gorman
@ 2009-02-03 10:36     ` Nick Piggin
  2009-02-03 11:22       ` Mel Gorman
  2009-02-03 11:28       ` Mel Gorman
  2009-02-03 18:58     ` Pekka Enberg
  1 sibling, 2 replies; 55+ messages in thread
From: Nick Piggin @ 2009-02-03 10:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tuesday 03 February 2009 21:12:06 Mel Gorman wrote:
> On Mon, Jan 26, 2009 at 10:48:26AM +0200, Pekka Enberg wrote:
> > Hi Nick,
> >
> > On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> > > Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up
> > > the code suggested by Ingo (haven't yet incorporated Ingo's last
> > > patch).
> > >
> > > Should have fixed the crash reported by Yanmin (I was able to reproduce
> > > it on an ia64 system and fix it).
> > >
> > > Significantly reduced static footprint of init arrays, thanks to Andi's
> > > suggestion.
> > >
> > > Please consider for trial merge for linux-next.
> >
> > I merged a the one you resent privately as this one didn't apply at all.
> > The code is in topic/slqb/core branch of slab.git and should appear in
> > linux-next tomorrow.
> >
> > Testing and especially performance testing is welcome. If any of the HPC
> > people are reading this, please do give SLQB a good beating as Nick's
> > plan is to replace both, SLAB and SLUB, with it in the long run.As
> > Christoph has expressed concerns over latency issues of SLQB, I suppose
> > it would be interesting to hear if it makes any difference to the
> > real-time folks.
>
> The HPC folks care about a few different workloads but speccpu is one that
> shows up. I was in the position to run tests because I had put together
> the test harness for a paper I spent the last month writing. This mail
> shows a comparison between slab, slub and slqb for speccpu2006 running a
> single thread and sysbench ranging clients from 1 to 4*num_online_cpus()
> (16 in both cases). Additional tests were not run because just these two
> take one day per kernel to complete. Results are ratios to the SLAB figures
> and based on an x86-64 and ppc64 machine.

Hi Mel,

This is very nice, thanks for testing. SLQB and SLUB are quite similar
in a lot of cases, which indeed could be explained by cacheline placement
(both of these can allocate down to much smaller sizes, and both of them
also put metadata directly in free object memory rather than external
locations).

But it will be interesting to try looking at some of the tests where
SLQB has larger regressions, so that might give me something to go on
if I can lay my hands on speccpu2006...

I'd be interested to see how slub performs if booted with slub_min_objects=1
(which should give similar order pages to SLAB and SLQB).



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 10:36     ` Nick Piggin
@ 2009-02-03 11:22       ` Mel Gorman
  2009-02-03 11:26         ` Mel Gorman
  2009-02-04  6:48         ` Nick Piggin
  2009-02-03 11:28       ` Mel Gorman
  1 sibling, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 11:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> On Tuesday 03 February 2009 21:12:06 Mel Gorman wrote:
> > On Mon, Jan 26, 2009 at 10:48:26AM +0200, Pekka Enberg wrote:
> > > Hi Nick,
> > >
> > > On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> > > > Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up
> > > > the code suggested by Ingo (haven't yet incorporated Ingo's last
> > > > patch).
> > > >
> > > > Should have fixed the crash reported by Yanmin (I was able to reproduce
> > > > it on an ia64 system and fix it).
> > > >
> > > > Significantly reduced static footprint of init arrays, thanks to Andi's
> > > > suggestion.
> > > >
> > > > Please consider for trial merge for linux-next.
> > >
> > > I merged a the one you resent privately as this one didn't apply at all.
> > > The code is in topic/slqb/core branch of slab.git and should appear in
> > > linux-next tomorrow.
> > >
> > > Testing and especially performance testing is welcome. If any of the HPC
> > > people are reading this, please do give SLQB a good beating as Nick's
> > > plan is to replace both, SLAB and SLUB, with it in the long run.As
> > > Christoph has expressed concerns over latency issues of SLQB, I suppose
> > > it would be interesting to hear if it makes any difference to the
> > > real-time folks.
> >
> > The HPC folks care about a few different workloads but speccpu is one that
> > shows up. I was in the position to run tests because I had put together
> > the test harness for a paper I spent the last month writing. This mail
> > shows a comparison between slab, slub and slqb for speccpu2006 running a
> > single thread and sysbench ranging clients from 1 to 4*num_online_cpus()
> > (16 in both cases). Additional tests were not run because just these two
> > take one day per kernel to complete. Results are ratios to the SLAB figures
> > and based on an x86-64 and ppc64 machine.
> 
> Hi Mel,
> 
> This is very nice, thanks for testing.

Sure. It's been on my TODO list for long enough :). I should have been
clear that the ratios are performance improvements based on wall time.
A result of 0.9862 implies a performance regression of 1.38% in comparison
to SLAB. 1.0031 implies a performance gain of 0.31% etc.

> SLQB and SLUB are quite similar
> in a lot of cases, which indeed could be explained by cacheline placement
> (both of these can allocate down to much smaller sizes, and both of them
> also put metadata directly in free object memory rather than external
> locations).
> 

Indeed. I know from other tests that poor cacheline placement can crucify
performance. My current understanding is we don't notice as data and metadata
are effectively using random cache lines.

> But it will be interesting to try looking at some of the tests where
> SLQB has larger regressions, so that might give me something to go on
> if I can lay my hands on speccpu2006...
> 

I can generate profile runs although it'll take 3 days to gather it all
together unless I target specific tests (the worst ones to start with
obviously). The suite has a handy feature called monitor hooks that allows
a pre and post script to run for each test which I use it to start/stop
oprofile and gather one report per benchmark. I didn't use it for this run
as profiling affects the outcome (7-9% overhead).

I do have detailed profile data available for sysbench, both per thread run
and the entire run but with the instruction-level included, it's a lot of
data to upload. If you still want it, I'll start it going and it'll get up
there eventually.

> I'd be interested to see how slub performs if booted with slub_min_objects=1
> (which should give similar order pages to SLAB and SLQB).
> 

I'll do this before profiling as only one run is required and should
only take a day.

Making spec actually build is tricky so I've included a sample config for
x86-64 below that uses gcc and the monitor hooks in case someone else is in
the position to repeat the results.

===== Begin sample spec config file =====
# Autogenerated by generate-speccpu.sh

## Base configuration
ignore_errors      = no
tune               = base
ext                = x86_64-m64-gcc42
output_format      = asc, pdf, Screen
reportable         = 1
teeout             = yes
teerunout          = yes
hw_avail           = September 2008
license_num        = 
test_sponsor       = 
prepared_by        = Mel Gorman
tester             = Mel Gorman
test_date          = Dec 2008

## Compiler
CC                 = gcc-4.2
CXX                = g++-4.2
FC                 = gfortran-4.2

## HW config
hw_model           = Gigabyte Technology Co., Ltd. GA-MA78GM-S2H
hw_cpu_name        = AMD Phenom(tm) 9950 Quad-Core Processor
hw_cpu_char        = 
hw_cpu_mhz         = 1300.000
hw_fpu             = Integrated
hw_nchips          = 1
hw_ncores          = 4
hw_ncoresperchip   = 4
hw_nthreadspercore = 1
hw_ncpuorder       = 
hw_pcache          = L1 64K Data, 64K Instruction per core
hw_scache          = L2 512K Unified per core
hw_tcache          = L3 2048K Unified Shared per chip
hw_ocache          = 
hw_memory          = 4594MB
hw_disk            = SATA WD5000AAKS-00A7B0
hw_vendor          = Komplett.ie

## SW config
sw_os              = Debian Lenny Beta for x86_64
sw_file            = ext3
sw_state           = Runlevel [2]
sw_compiler        = gcc, g++ & gfortran 4.2 for x86_64
sw_avail           = Dec 2008
sw_other           = None
sw_auto_parallel   = No
sw_base_ptrsize    = 64-bit
sw_peak_ptrsize    = Not Applicable

## Monitor hooks
monitor_pre_bench = /home/mel/git-public/vmregress/bin/oprofile_start.sh  --event timer --event dtlb_miss; echo iter >> /tmp/OPiter.${lognum}.${size_class}.${benchmark}
monitor_post_bench = opcontrol --stop ; /home/mel/git-public/vmregress/bin/oprofile_report.sh > `dirname ${logname}`/OP.${lognum}.${size_class}.iter`cat /tmp/OPiter.${lognum}.${size_class}.${benchmark} | wc -l`.${benchmark}.txt

## Optimisation
makeflags          = -j4
COPTIMIZE          = -O2 -m64
CXXOPTIMIZE        = -O2 -m64
FOPTIMIZE          = -O2 -m64

notes0100= C base flags: $[COPTIMIZE]
notes0110= C++ base flags: $[CXXOPTIMIZE]
notes0120= Fortran base flags: $[FOPTIMIZE]

## Portability flags - all
default=base=default=default:
notes35            = PORTABILITY=-DSPEC_CPU_LP64 is applied to all benchmarks
PORTABILITY        = -DSPEC_CPU_LP64

## Portability flags - int
400.perlbench=default=default=default:
CPORTABILITY       = -DSPEC_CPU_LINUX_X64
notes35            = 400.perlbench: -DSPEC_CPU_LINUX_X64

462.libquantum=default=default=default:
CPORTABILITY       = -DSPEC_CPU_LINUX
notes60            = 462.libquantum: -DSPEC_CPU_LINUX

483.xalancbmk=default=default=default:
CXXPORTABILITY       = -DSPEC_CPU_LINUX

## Portability flags - flt
481.wrf=default=default=default:
CPORTABILITY      = -DSPEC_CPU_CASE_FLAG -DSPEC_CPU_LINUX

__MD5__

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 11:22       ` Mel Gorman
@ 2009-02-03 11:26         ` Mel Gorman
  2009-02-04  6:48         ` Nick Piggin
  1 sibling, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 11:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tue, Feb 03, 2009 at 11:22:26AM +0000, Mel Gorman wrote:
> > <SNIP>
> > This is very nice, thanks for testing.
> 
> Sure. It's been on my TODO list for long enough :). I should have been
> clear that the ratios are performance improvements based on wall time.

/me slaps self

For SPEC, it's performance improvements based on wall time as measured
by the suite. For sysbench, it's performance improvements based on operations
per second.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 10:36     ` Nick Piggin
  2009-02-03 11:22       ` Mel Gorman
@ 2009-02-03 11:28       ` Mel Gorman
  2009-02-03 11:50         ` Nick Piggin
  1 sibling, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 11:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> On Tuesday 03 February 2009 21:12:06 Mel Gorman wrote:
> > On Mon, Jan 26, 2009 at 10:48:26AM +0200, Pekka Enberg wrote:
> > > Hi Nick,
> > >
> > > On Fri, 2009-01-23 at 16:46 +0100, Nick Piggin wrote:
> > > > Since last time, fixed bugs pointed out by Hugh and Andi, cleaned up
> > > > the code suggested by Ingo (haven't yet incorporated Ingo's last
> > > > patch).
> > > >
> > > > Should have fixed the crash reported by Yanmin (I was able to reproduce
> > > > it on an ia64 system and fix it).
> > > >
> > > > Significantly reduced static footprint of init arrays, thanks to Andi's
> > > > suggestion.
> > > >
> > > > Please consider for trial merge for linux-next.
> > >
> > > I merged a the one you resent privately as this one didn't apply at all.
> > > The code is in topic/slqb/core branch of slab.git and should appear in
> > > linux-next tomorrow.
> > >
> > > Testing and especially performance testing is welcome. If any of the HPC
> > > people are reading this, please do give SLQB a good beating as Nick's
> > > plan is to replace both, SLAB and SLUB, with it in the long run.As
> > > Christoph has expressed concerns over latency issues of SLQB, I suppose
> > > it would be interesting to hear if it makes any difference to the
> > > real-time folks.
> >
> > The HPC folks care about a few different workloads but speccpu is one that
> > shows up. I was in the position to run tests because I had put together
> > the test harness for a paper I spent the last month writing. This mail
> > shows a comparison between slab, slub and slqb for speccpu2006 running a
> > single thread and sysbench ranging clients from 1 to 4*num_online_cpus()
> > (16 in both cases). Additional tests were not run because just these two
> > take one day per kernel to complete. Results are ratios to the SLAB figures
> > and based on an x86-64 and ppc64 machine.
> 
> Hi Mel,
> 
> This is very nice, thanks for testing. SLQB and SLUB are quite similar
> in a lot of cases, which indeed could be explained by cacheline placement
> (both of these can allocate down to much smaller sizes, and both of them
> also put metadata directly in free object memory rather than external
> locations).
> 
> But it will be interesting to try looking at some of the tests where
> SLQB has larger regressions, so that might give me something to go on
> if I can lay my hands on speccpu2006...
> 
> I'd be interested to see how slub performs if booted with slub_min_objects=1
> (which should give similar order pages to SLAB and SLQB).
> 

Just to clarify on this last point, do you mean slub_max_order=0 to
force order-0 allocations in SLUB?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 11:28       ` Mel Gorman
@ 2009-02-03 11:50         ` Nick Piggin
  2009-02-03 12:01           ` Mel Gorman
  2009-02-04 15:48           ` Christoph Lameter
  0 siblings, 2 replies; 55+ messages in thread
From: Nick Piggin @ 2009-02-03 11:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tuesday 03 February 2009 22:28:52 Mel Gorman wrote:
> On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:

> > I'd be interested to see how slub performs if booted with
> > slub_min_objects=1 (which should give similar order pages to SLAB and
> > SLQB).
>
> Just to clarify on this last point, do you mean slub_max_order=0 to
> force order-0 allocations in SLUB?

Hmm... I think slub_min_objects=1 should also do basically the same.
Actually slub_min_object=1 and slub_max_order=1 should get closest I
think.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 11:50         ` Nick Piggin
@ 2009-02-03 12:01           ` Mel Gorman
  2009-02-03 12:07             ` Nick Piggin
  2009-02-04 15:48           ` Christoph Lameter
  1 sibling, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 12:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tue, Feb 03, 2009 at 10:50:54PM +1100, Nick Piggin wrote:
> On Tuesday 03 February 2009 22:28:52 Mel Gorman wrote:
> > On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> 
> > > I'd be interested to see how slub performs if booted with
> > > slub_min_objects=1 (which should give similar order pages to SLAB and
> > > SLQB).
> >
> > Just to clarify on this last point, do you mean slub_max_order=0 to
> > force order-0 allocations in SLUB?
> 
> Hmm... I think slub_min_objects=1 should also do basically the same.
> Actually slub_min_object=1 and slub_max_order=1 should get closest I
> think.
> 

I'm going with slub_min_objects=1 and slub_max_order=0. A quick glance
of the source shows the calculation as

        for (order = max(min_order,
                                fls(min_objects * size - 1) - PAGE_SHIFT);
                        order <= max_order; order++) {

so the max_order is inclusive not exclusive. This will force the order-0
allocations I think you are looking for.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 12:01           ` Mel Gorman
@ 2009-02-03 12:07             ` Nick Piggin
  2009-02-03 12:26               ` Mel Gorman
  2009-02-04 15:49               ` Christoph Lameter
  0 siblings, 2 replies; 55+ messages in thread
From: Nick Piggin @ 2009-02-03 12:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tuesday 03 February 2009 23:01:39 Mel Gorman wrote:
> On Tue, Feb 03, 2009 at 10:50:54PM +1100, Nick Piggin wrote:
> > On Tuesday 03 February 2009 22:28:52 Mel Gorman wrote:
> > > On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> > > > I'd be interested to see how slub performs if booted with
> > > > slub_min_objects=1 (which should give similar order pages to SLAB and
> > > > SLQB).
> > >
> > > Just to clarify on this last point, do you mean slub_max_order=0 to
> > > force order-0 allocations in SLUB?
> >
> > Hmm... I think slub_min_objects=1 should also do basically the same.
> > Actually slub_min_object=1 and slub_max_order=1 should get closest I
> > think.
>
> I'm going with slub_min_objects=1 and slub_max_order=0. A quick glance
> of the source shows the calculation as
>
>         for (order = max(min_order,
>                                 fls(min_objects * size - 1) - PAGE_SHIFT);
>                         order <= max_order; order++) {
>
> so the max_order is inclusive not exclusive. This will force the order-0
> allocations I think you are looking for.

Well, but in the case of really bad internal fragmentation in the page,
SLAB will do order-1 allocations even if it doesn't strictly need to.
Probably this isn't a huge deal, but I think if we do slub_min_objects=1,
then SLUB won't care about number of objects per page, and slub_max_order=1
will mean it stops caring about fragmentation after order-1. I think. Which
would be pretty close to SLAB (depending on exactly how much fragmentation
it cares about).


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 12:07             ` Nick Piggin
@ 2009-02-03 12:26               ` Mel Gorman
  2009-02-04 15:49               ` Christoph Lameter
  1 sibling, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-03 12:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tue, Feb 03, 2009 at 11:07:07PM +1100, Nick Piggin wrote:
> On Tuesday 03 February 2009 23:01:39 Mel Gorman wrote:
> > On Tue, Feb 03, 2009 at 10:50:54PM +1100, Nick Piggin wrote:
> > > On Tuesday 03 February 2009 22:28:52 Mel Gorman wrote:
> > > > On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> > > > > I'd be interested to see how slub performs if booted with
> > > > > slub_min_objects=1 (which should give similar order pages to SLAB and
> > > > > SLQB).
> > > >
> > > > Just to clarify on this last point, do you mean slub_max_order=0 to
> > > > force order-0 allocations in SLUB?
> > >
> > > Hmm... I think slub_min_objects=1 should also do basically the same.
> > > Actually slub_min_object=1 and slub_max_order=1 should get closest I
> > > think.
> >
> > I'm going with slub_min_objects=1 and slub_max_order=0. A quick glance
> > of the source shows the calculation as
> >
> >         for (order = max(min_order,
> >                                 fls(min_objects * size - 1) - PAGE_SHIFT);
> >                         order <= max_order; order++) {
> >
> > so the max_order is inclusive not exclusive. This will force the order-0
> > allocations I think you are looking for.
> 
> Well, but in the case of really bad internal fragmentation in the page,
> SLAB will do order-1 allocations even if it doesn't strictly need to.
> Probably this isn't a huge deal, but I think if we do slub_min_objects=1,
> then SLUB won't care about number of objects per page, and slub_max_order=1
> will mean it stops caring about fragmentation after order-1. I think. Which
> would be pretty close to SLAB (depending on exactly how much fragmentation
> it cares about).
> 

Ok, very good point and I agree with your assessment. Tests are restarted with
slub_min_objects=1 slab_max_order=1. The dmesg line related to SLUB looks like

SLUB: Genslabs=13, HWalign=64, Order=0-1, MinObjects=1, CPUs=4, Nodes=1

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 10:12   ` Mel Gorman
  2009-02-03 10:36     ` Nick Piggin
@ 2009-02-03 18:58     ` Pekka Enberg
  2009-02-04 16:06       ` Christoph Lameter
  1 sibling, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-03 18:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Mel,

Mel Gorman wrote:
> The OLTP workload results could indicate a downside with using sysbench
> although it could also be hardware. The reports from the Intel guys have been
> pretty clear-cut that SLUB is a loser but sysbench-postgres on these test
> machines at least do not agree. Of course their results are perfectly valid
> but the discrepency needs to be explained or there will be a disconnect
> between developers and the performance people.  Something important is
> missing that means sysbench-postgres *may* not be a reliable indicator of
> TPC-C performance.  It could easily be down to the hardware as their tests
> are on a mega-large machine with oodles of disks and probably NUMA where
> the test machine used for this is a lot less respectable.

Yup. That's more or less what I've been saying for a long time now. The 
OLTP regression is not all obvious and while there has been plenty of 
talk about it (cache line ping-pong due to lack of queues, high order 
pages), I've yet to see a detailed analysis on it.

It would be interesting to know what drivers the Intel setup uses. One 
thing I speculated with Christoph at OLS is that the regression could be 
due to bad interaction with the SCSI subsystem, for example. That would 
explain why the regression doesn't show up in typical setups which have ATA.

Anyway, even if we did end up going forward with SLQB, it would sure as 
hell be less painful if we understood the reasons behind it.

			Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 11:22       ` Mel Gorman
  2009-02-03 11:26         ` Mel Gorman
@ 2009-02-04  6:48         ` Nick Piggin
  2009-02-04 15:27           ` Mel Gorman
  1 sibling, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2009-02-04  6:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Tuesday 03 February 2009 22:22:26 Mel Gorman wrote:
> On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:

> > But it will be interesting to try looking at some of the tests where
> > SLQB has larger regressions, so that might give me something to go on
> > if I can lay my hands on speccpu2006...
>
> I can generate profile runs although it'll take 3 days to gather it all
> together unless I target specific tests (the worst ones to start with
> obviously). The suite has a handy feature called monitor hooks that allows
> a pre and post script to run for each test which I use it to start/stop
> oprofile and gather one report per benchmark. I didn't use it for this run
> as profiling affects the outcome (7-9% overhead).
>
> I do have detailed profile data available for sysbench, both per thread run
> and the entire run but with the instruction-level included, it's a lot of
> data to upload. If you still want it, I'll start it going and it'll get up
> there eventually.

It couldn't hurt, but it's usually tricky to read anything out of these from
CPU cycle profiles. Especially if they are due to cache or tlb effects (which
tend to just get spread out all over the profile).

slabinfo (for SLUB) and slqbinfo (for SLQB) activity data could be interesting
(invoke with -AD).


> > I'd be interested to see how slub performs if booted with
> > slub_min_objects=1 (which should give similar order pages to SLAB and
> > SLQB).
>
> I'll do this before profiling as only one run is required and should
> only take a day.
>
> Making spec actually build is tricky so I've included a sample config for
> x86-64 below that uses gcc and the monitor hooks in case someone else is in
> the position to repeat the results.

Thanks. I don't know if we have a copy of spec 2006 I can use, but I'll ask
around.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-04  6:48         ` Nick Piggin
@ 2009-02-04 15:27           ` Mel Gorman
  2009-02-05  3:59             ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2009-02-04 15:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Wed, Feb 04, 2009 at 05:48:40PM +1100, Nick Piggin wrote:
> On Tuesday 03 February 2009 22:22:26 Mel Gorman wrote:
> > On Tue, Feb 03, 2009 at 09:36:24PM +1100, Nick Piggin wrote:
> 
> > > But it will be interesting to try looking at some of the tests where
> > > SLQB has larger regressions, so that might give me something to go on
> > > if I can lay my hands on speccpu2006...
> >
> > I can generate profile runs although it'll take 3 days to gather it all
> > together unless I target specific tests (the worst ones to start with
> > obviously). The suite has a handy feature called monitor hooks that allows
> > a pre and post script to run for each test which I use it to start/stop
> > oprofile and gather one report per benchmark. I didn't use it for this run
> > as profiling affects the outcome (7-9% overhead).
> >
> > I do have detailed profile data available for sysbench, both per thread run
> > and the entire run but with the instruction-level included, it's a lot of
> > data to upload. If you still want it, I'll start it going and it'll get up
> > there eventually.
> 
> It couldn't hurt, but it's usually tricky to read anything out of these from
> CPU cycle profiles. Especially if they are due to cache or tlb effects (which
> tend to just get spread out all over the profile).
> 

Indeed. To date, I've used them for comparing relative counts of things like
TLB and cache misses on the basis "relatively more misses running test X is
bad" or working out things like tlb-misses-per-instructions but it's a bit
vague. We might notice if one of the allocators is being particularly cache
unfriendly due to a spike in cache misses.

> slabinfo (for SLUB) and slqbinfo (for SLQB) activity data could be interesting
> (invoke with -AD).
> 

Ok, I butchered Ingo's proc monitoring script to gather /proc/slabinfo,
slabinfo -AD and slqbinfo -AD before and after each speccpu subtest.  The tests
with profiling just started but it will take a few days to complete and thats
assuming I made no mistakes in the automation. I'll be at FOSDEM from Friday
till Monday so may not be able to collect the results until Monday.

> 
> > > I'd be interested to see how slub performs if booted with
> > > slub_min_objects=1 (which should give similar order pages to SLAB and
> > > SLQB).
> >
> > I'll do this before profiling as only one run is required and should
> > only take a day.
> >
> > Making spec actually build is tricky so I've included a sample config for
> > x86-64 below that uses gcc and the monitor hooks in case someone else is in
> > the position to repeat the results.
> 
> Thanks. I don't know if we have a copy of spec 2006 I can use, but I'll ask
> around.
> 

In the meantime, here are the results I have with slub configured  to
use small orders.

X86-64 Test machine
        CPU             AMD Phenom 9950 Quad-Core
        CPU Frequency   1.3GHz
        Physical CPUs   1 (4 cores)
        L1 Cache        64K Data, 64K Instruction per core
        L2 Cache        512K Unified per core
        L3 Cache        2048K Unified Shared per chip
        Main Memory     8 GB
        Mainboard       Gigabyte GA-MA78GM-S2H
        Machine Model   Custom built from parts

SPEC CPU 2006
-------------
Integer tests
SPEC test       slab                 slub  slub-minorder           slqb
400.perlbench   1.0000             1.0016         0.9921         1.0064
401.bzip2       1.0000             0.9804         0.9858         1.0011
403.gcc         1.0000             1.0023         0.9977         0.9965
429.mcf         1.0000             1.0022         0.9847         0.9963
445.gobmk       1.0000             0.9944         0.9958         0.9986
456.hmmer       1.0000             0.9792         0.9874         0.9701
458.sjeng       1.0000             0.9989         1.0144         1.0133
462.libquantum  1.0000             0.9905         0.9943         0.9981
464.h264ref     1.0000             0.9877         0.9926         1.0058
471.omnetpp     1.0000             0.9893         1.0896         1.0993
473.astar       1.0000             0.9542         0.9930         0.9596
483.xalancbmk   1.0000             0.9547         0.9928         0.9982
---------------
specint geomean 1.0000             0.9862         1.0013         1.0031

Floating Point Tests
SPEC test       slab                 slub  slub-minorder           slqb
410.bwaves      1.0000             0.9939         1.0000         1.0005
416.gamess      1.0000             1.0040         1.0032         0.9984
433.milc        1.0000             0.9865         0.9986         0.9865
434.zeusmp      1.0000             0.9810         0.9980         0.9879
435.gromacs     1.0000             0.9854         1.0100         1.0125
436.cactusADM   1.0000             1.0467         0.9904         1.0294
437.leslie3d    1.0000             0.9846         0.9970         0.9963
444.namd        1.0000             1.0000         0.9986         1.0000
447.dealII      1.0000             0.9913         0.9957         0.9957
450.soplex      1.0000             0.9940         0.9955         1.0015
453.povray      1.0000             0.9904         1.0097         1.0197
454.calculix    1.0000             0.9937         0.9975         1.0000
459.GemsFDTD    1.0000             1.0061         0.9902         1.0000
465.tonto       1.0000             0.9979         1.0000         0.9989
470.lbm         1.0000             1.0099         0.9924         1.0212
481.wrf         1.0000             1.0000         1.0045         1.0045
482.sphinx3     1.0000             1.0047         1.0000         1.0068
---------------
specfp geomean  1.0000             0.9981         0.9989         1.0035

Sysbench-Postgres
-----------------
Client           slab  slub-default  slub-minorder            slqb
     1         1.0000        0.9484         0.9699          0.9804
     2         1.0000        1.0069         1.0036          0.9994
     3         1.0000        1.0064         1.0080          0.9994
     4         1.0000        0.9900         1.0049          0.9904
     5         1.0000        1.0023         1.0144          0.9869
     6         1.0000        1.0139         1.0215          1.0069
     7         1.0000        0.9973         0.9966          0.9991
     8         1.0000        1.0206         1.0223          1.0197
     9         1.0000        0.9884         1.0167          0.9817
    10         1.0000        0.9980         0.9842          1.0135
    11         1.0000        0.9959         1.0036          1.0164
    12         1.0000        0.9978         1.0032          0.9953
    13         1.0000        1.0024         1.0022          0.9942
    14         1.0000        0.9975         1.0064          0.9808
    15         1.0000        0.9914         0.9949          0.9933
    16         1.0000        0.9767         0.9692          0.9726
Geo. mean      1.0000        0.9957         1.0012          0.9955

PPC64 Test Machine
        CPU              PPC970MP, altivec supported
        CPU Frequency    2.5GHz
        Physical CPUs 2 x dual core (4 cores in all)
        L1 Cache         32K Data, 64K Instruction per core
        L2 Cache         1024K Unified per core
        L3 Cache         N/a
        Main Memory      10GB
        Mainboard        Specific to the machine model

SPEC CPU 2006
-------------
Integer tests
SPEC test       slab                 slub  slub-minorder           slqb
400.perlbench   1.0000             1.0497         1.0515         1.0497
401.bzip2       1.0000             1.0496         1.0496         1.0489
403.gcc         1.0000             1.0509         1.0509         1.0509
429.mcf         1.0000             1.0554         1.0549         1.0549
445.gobmk       1.0000             1.0535         1.0545         1.0556
456.hmmer       1.0000             1.0651         1.0636         1.0566
458.sjeng       1.0000             1.0612         1.0612         1.0564
462.libquantum  1.0000             1.0389         1.0403         1.0396
464.h264ref     1.0000             1.0517         1.0496         1.0503
471.omnetpp     1.0000             1.0555         1.0574         1.0574
473.astar       1.0000             1.0508         1.0514         1.0521
483.xalancbmk   1.0000             1.0594         1.0584         1.0584
---------------
specint geomean 1.0000             1.0534         1.0536         1.0525

Floating Point Tests
SPEC test       slab                 slub  slub-minorder           slqb
410.bwaves      1.0000             1.0381         1.0381         1.0367
416.gamess      1.0000             1.0550         1.0539         1.0550
433.milc        1.0000             1.0464         1.0457         1.0450
434.zeusmp      1.0000             1.0510         1.0482         1.0528
435.gromacs     1.0000             1.0461         1.0437         1.0445
436.cactusADM   1.0000             1.0457         1.0463         1.0450
437.leslie3d    1.0000             1.0437         1.0437         1.0428
444.namd        1.0000             1.0482         1.0482         1.0496
447.dealII      1.0000             1.0505         1.0495         1.0505
450.soplex      1.0000             1.0522         1.0511         1.0499
453.povray      1.0000             1.0513         1.0534         1.0534
454.calculix    1.0000             1.0374         1.0370         1.0357
459.GemsFDTD    1.0000             1.0465         1.0465         1.0465
465.tonto       1.0000             1.0488         1.0494         1.0456
470.lbm         1.0000             1.0438         1.0438         1.0452
481.wrf         1.0000             1.0423         1.0423         1.0429
482.sphinx3     1.0000             1.0464         1.0479         1.0479
---------------
specfp geomean  1.0000             1.0467         1.0464         1.0464

Sysbench-Postgres
-----------------
Client           slab  slub-default  slub-minorder            slqb
     1         1.0000        1.0153         1.0179          1.0051
     2         1.0000        1.0273         1.0181          1.0269
     3         1.0000        1.0299         1.0195          1.0234
     4         1.0000        1.0159         1.0130          1.0146
     5         1.0000        1.0232         1.0192          1.0264
     6         1.0000        1.0238         1.0142          1.0088
     7         1.0000        1.0240         1.0063          1.0076
     8         1.0000        1.0134         0.9842          1.0024
     9         1.0000        1.0154         1.0152          1.0077
    10         1.0000        1.0126         1.0018          1.0009
    11         1.0000        1.0100         0.9971          0.9933
    12         1.0000        1.0112         0.9985          0.9993
    13         1.0000        1.0131         1.0060          1.0035
    14         1.0000        1.0237         1.0074          1.0071
    15         1.0000        1.0098         0.9997          0.9997
    16         1.0000        1.0110         0.9899          0.9994
Geo. mean      1.0000        1.0175         1.0067          1.0078

The order SLUB uses does not make much of a difference to SPEC CPU on
either test machine or sysbench on x86-64. Howeer, on the ppc64 machine, the
performance advantage SLUB has over SLAB appears to be eliminated if high-order
pages are not used. I think I might run SLUB again incase the higher average
performance was a co-incidence due to lucky cache layout. Otherwise, Christoph
can probably put together a plausible theory on this result faster than I can.

On the TLB front, it is perfectly possible that the workloads on x86-64 are
not allocator or memory intensive enough to take advantage of fewer calls to
the page allocator or potentially reduced TLB pressure. As the kernel portion
of the address space already uses huge pages slab objects may have to occupy
a very large percentage of memory before TLB pressure became an issue. The L1
TLBs on both test machines are fully associative making testing reduced TLB
pressure practically impossible. For bonus points, 1G pages are being used on
the x86-64 so I have nowhere near enough memory to put that under TLB pressure.

Measuring reduced metadata overhead is more plausible.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 11:50         ` Nick Piggin
  2009-02-03 12:01           ` Mel Gorman
@ 2009-02-04 15:48           ` Christoph Lameter
  1 sibling, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2009-02-04 15:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Pekka Enberg, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, 3 Feb 2009, Nick Piggin wrote:

> > Just to clarify on this last point, do you mean slub_max_order=0 to
> > force order-0 allocations in SLUB?
>
> Hmm... I think slub_min_objects=1 should also do basically the same.
> Actually slub_min_object=1 and slub_max_order=1 should get closest I
> think.

slub_max_order=0 would be sufficient.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 12:07             ` Nick Piggin
  2009-02-03 12:26               ` Mel Gorman
@ 2009-02-04 15:49               ` Christoph Lameter
  1 sibling, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2009-02-04 15:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Pekka Enberg, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, 3 Feb 2009, Nick Piggin wrote:

> > so the max_order is inclusive not exclusive. This will force the order-0
> > allocations I think you are looking for.
>
> Well, but in the case of really bad internal fragmentation in the page,
> SLAB will do order-1 allocations even if it doesn't strictly need to.
> Probably this isn't a huge deal, but I think if we do slub_min_objects=1,
> then SLUB won't care about number of objects per page, and slub_max_order=1
> will mean it stops caring about fragmentation after order-1. I think. Which
> would be pretty close to SLAB (depending on exactly how much fragmentation
> it cares about).

slub_max_order=0 will fore all possible slabs to order 0. This means that
some slabs that SLAB will run as order 1 will be order 0 under SLUB.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-03 18:58     ` Pekka Enberg
@ 2009-02-04 16:06       ` Christoph Lameter
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2009-02-04 16:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Tue, 3 Feb 2009, Pekka Enberg wrote:

> Anyway, even if we did end up going forward with SLQB, it would sure as hell
> be less painful if we understood the reasons behind it.

The reasons may depend on hardware contingencies like TLB handling
overhead and various inefficiencies that depend on the exact processor
model. Also the type of applications you want to run. Some of the IA64
heritage of SLUB may be seen in the results of these tests. Note that PPC
and IA64 have larger page sizes (which results in SLUB being able to put
more objects into an order 0 page) and higher penalties for TLB handling.
The initial justification for SLUB were Mel's results on IA64 that showed
a 5-10% increase in performance through SLUB.

In my current position we need to run extremely low latency code in user
space and want to avoid any disturbance by kernel code interrupting user
space. My main concern for my current work context is that switching to
SLQB will bring back the old cache cleaning problems and introduce
latencies for our user space applications. Otherwise I am on x86 now so
the TLB issues are less of a concern for me now.

In general it may be better to have a larger selection of slab allocators.
I think this is no problem as long as we have motivated people that
maintain these. Nick seems to be very motivated at this point. So lets
merge SLQB as soon as we can and expose it to a wider audience so that it
can mature. And people can have more fun running one against the other
refining these more and more.

There are still two major things that I hope will happen soon to clean up stuff in
the slab allocators:

1. The introduction of a per cpu allocator.

This is important to optimize the fastpaths. The cpu allocator will allow
us to get rid of the arrays indexes by NR_CPUS and allow operations that
are atomic wrt. interrupts. The lookup of the kmem_cache_cpu struct
address will no longer be necessary.

2. Alloc/free without disabling interrupts.

Matthieu has written an early implementation of allocation functions that
do not require interrupt disable/enable. It seems that these are right now
the major cause of latency in the fast paths. Andi has stated that the
interrupt enable/disable has been optimized in recent releases of new
processors. The overhead may be due to the flags being pushed onto the
stack and retrieved later. Mathieus implementation can be made more
elegant if atomic per cpu ops are available. This could significantly
increase the speed of the fast paths in the allocators (may be a challenge
to SLAB and SLQB since they need to update a counter and a pointer but its
straightforward in SLUB).


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-04 15:27           ` Mel Gorman
@ 2009-02-05  3:59             ` Nick Piggin
  2009-02-05 13:49               ` Mel Gorman
  2009-02-16 18:42               ` Mel Gorman
  0 siblings, 2 replies; 55+ messages in thread
From: Nick Piggin @ 2009-02-05  3:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thursday 05 February 2009 02:27:10 Mel Gorman wrote:
> On Wed, Feb 04, 2009 at 05:48:40PM +1100, Nick Piggin wrote:

> > It couldn't hurt, but it's usually tricky to read anything out of these
> > from CPU cycle profiles. Especially if they are due to cache or tlb
> > effects (which tend to just get spread out all over the profile).
>
> Indeed. To date, I've used them for comparing relative counts of things
> like TLB and cache misses on the basis "relatively more misses running test
> X is bad" or working out things like tlb-misses-per-instructions but it's a
> bit vague. We might notice if one of the allocators is being particularly
> cache unfriendly due to a spike in cache misses.

Very true. Total counts of TLB and cache misses could show some insight.


> PPC64 Test Machine
> Sysbench-Postgres
> -----------------
> Client           slab  slub-default  slub-minorder            slqb
>      1         1.0000        1.0153         1.0179          1.0051
>      2         1.0000        1.0273         1.0181          1.0269
>      3         1.0000        1.0299         1.0195          1.0234
>      4         1.0000        1.0159         1.0130          1.0146
>      5         1.0000        1.0232         1.0192          1.0264
>      6         1.0000        1.0238         1.0142          1.0088
>      7         1.0000        1.0240         1.0063          1.0076
>      8         1.0000        1.0134         0.9842          1.0024
>      9         1.0000        1.0154         1.0152          1.0077
>     10         1.0000        1.0126         1.0018          1.0009
>     11         1.0000        1.0100         0.9971          0.9933
>     12         1.0000        1.0112         0.9985          0.9993
>     13         1.0000        1.0131         1.0060          1.0035
>     14         1.0000        1.0237         1.0074          1.0071
>     15         1.0000        1.0098         0.9997          0.9997
>     16         1.0000        1.0110         0.9899          0.9994
> Geo. mean      1.0000        1.0175         1.0067          1.0078
>
> The order SLUB uses does not make much of a difference to SPEC CPU on
> either test machine or sysbench on x86-64. Howeer, on the ppc64 machine,
> the performance advantage SLUB has over SLAB appears to be eliminated if
> high-order pages are not used. I think I might run SLUB again incase the
> higher average performance was a co-incidence due to lucky cache layout.
> Otherwise, Christoph can probably put together a plausible theory on this
> result faster than I can.

It's interesting, thanks. It's a good result for SLQB I guess. 1% is fairly
large here (if it is statistically significant), but I don't think the
drawbacks of using higher order pages warrant changing anything by default
in SLQB. It does encourage me to add a boot or runtime parameter, though
(even if just for testing purposes).


> On the TLB front, it is perfectly possible that the workloads on x86-64 are
> not allocator or memory intensive enough to take advantage of fewer calls
> to the page allocator or potentially reduced TLB pressure. As the kernel
> portion of the address space already uses huge pages slab objects may have
> to occupy a very large percentage of memory before TLB pressure became an
> issue. The L1 TLBs on both test machines are fully associative making
> testing reduced TLB pressure practically impossible. For bonus points, 1G
> pages are being used on the x86-64 so I have nowhere near enough memory to
> put that under TLB pressure.

TLB pressure... I would be interested in. I'm not exactly sold on the idea
that higher order allocations will give a significant TLB improvement.
Although for benchmark runs, maybe it is more likely (ie. if memory hasn't
been too fragmented).

Suppose you have a million slab objects scattered all over memory, the fact
you might have them clumped into 64K regions rather than 4K regions... is
it going to be significant? How many access patterns are likely to soon touch
exactly those objects that are in the same page?

Sure it is possible to come up with a scenario where it does help. But also
others where it will not.

OTOH, if it is a win on ppc but not x86-64, then that may point to TLB...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-05  3:59             ` Nick Piggin
@ 2009-02-05 13:49               ` Mel Gorman
  2009-02-16 18:42               ` Mel Gorman
  1 sibling, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-05 13:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Thu, Feb 05, 2009 at 02:59:29PM +1100, Nick Piggin wrote:
> On Thursday 05 February 2009 02:27:10 Mel Gorman wrote:
> > On Wed, Feb 04, 2009 at 05:48:40PM +1100, Nick Piggin wrote:
> 
> > > It couldn't hurt, but it's usually tricky to read anything out of these
> > > from CPU cycle profiles. Especially if they are due to cache or tlb
> > > effects (which tend to just get spread out all over the profile).
> >
> > Indeed. To date, I've used them for comparing relative counts of things
> > like TLB and cache misses on the basis "relatively more misses running test
> > X is bad" or working out things like tlb-misses-per-instructions but it's a
> > bit vague. We might notice if one of the allocators is being particularly
> > cache unfriendly due to a spike in cache misses.
> 
> Very true. Total counts of TLB and cache misses could show some insight.
> 

Agreed. I'm collecting just cache misses in this run. Due to limitations
on the ppc970 PMU, I can't collect TLB and cache at the same time. I
can't remember if it can or not on the x86-64 but I went with the lowest
common denominator in any case.

> 
> > PPC64 Test Machine
> > Sysbench-Postgres
> > -----------------
> > Client           slab  slub-default  slub-minorder            slqb
> >      1         1.0000        1.0153         1.0179          1.0051
> >      2         1.0000        1.0273         1.0181          1.0269
> >      3         1.0000        1.0299         1.0195          1.0234
> >      4         1.0000        1.0159         1.0130          1.0146
> >      5         1.0000        1.0232         1.0192          1.0264
> >      6         1.0000        1.0238         1.0142          1.0088
> >      7         1.0000        1.0240         1.0063          1.0076
> >      8         1.0000        1.0134         0.9842          1.0024
> >      9         1.0000        1.0154         1.0152          1.0077
> >     10         1.0000        1.0126         1.0018          1.0009
> >     11         1.0000        1.0100         0.9971          0.9933
> >     12         1.0000        1.0112         0.9985          0.9993
> >     13         1.0000        1.0131         1.0060          1.0035
> >     14         1.0000        1.0237         1.0074          1.0071
> >     15         1.0000        1.0098         0.9997          0.9997
> >     16         1.0000        1.0110         0.9899          0.9994
> > Geo. mean      1.0000        1.0175         1.0067          1.0078
> >
> > The order SLUB uses does not make much of a difference to SPEC CPU on
> > either test machine or sysbench on x86-64. Howeer, on the ppc64 machine,
> > the performance advantage SLUB has over SLAB appears to be eliminated if
> > high-order pages are not used. I think I might run SLUB again incase the
> > higher average performance was a co-incidence due to lucky cache layout.
> > Otherwise, Christoph can probably put together a plausible theory on this
> > result faster than I can.
> 
> It's interesting, thanks. It's a good result for SLQB I guess. 1% is fairly
> large here (if it is statistically significant),

I believe it is. I don't recall the figures deviating much for sysbench but
I have to alter the scripts to do multiple runs just in case.

> but I don't think the
> drawbacks of using higher order pages warrant changing anything by default
> in SLQB. It does encourage me to add a boot or runtime parameter, though
> (even if just for testing purposes).
> 

Based on this test-machine, it's not justified by default but as the tests are
not allocator intensive it doesn't say much. tbench (or netperf or anything
that is more slab intensive) might reveal something but it'll be Monday at
the earliest before I can find out.

> > On the TLB front, it is perfectly possible that the workloads on x86-64 are
> > not allocator or memory intensive enough to take advantage of fewer calls
> > to the page allocator or potentially reduced TLB pressure. As the kernel
> > portion of the address space already uses huge pages slab objects may have
> > to occupy a very large percentage of memory before TLB pressure became an
> > issue. The L1 TLBs on both test machines are fully associative making
> > testing reduced TLB pressure practically impossible. For bonus points, 1G
> > pages are being used on the x86-64 so I have nowhere near enough memory to
> > put that under TLB pressure.
> 
> TLB pressure... I would be interested in. I'm not exactly sold on the idea
> that higher order allocations will give a significant TLB improvement.

Currently, I suspect the machine has to be running a very long time and the
memory footprint used by SLAB has to be significant before a large enough
number of TLB entries are being used. A side-effect of anti-fragmentation
is that kernel allocations get grouped into hugepages as much as possible
without using high-order pages. This makes it even harder to cause TLB
pressure within the kernel (a good thing in general).

> Although for benchmark runs, maybe it is more likely (ie. if memory hasn't
> been too fragmented).
> 
> Suppose you have a million slab objects scattered all over memory, the fact
> you might have them clumped into 64K regions rather than 4K regions... is
> it going to be significant?

I doubt it's significant from a TLB perspective. Anti-frag will be clumping the
the 4K pages together in 2MB (on x86-64) and 16MB (on ppc64) already. It would
actually be pretty tricky to form an allocation pattern from userspace that
would force use of multiple huge pages. While it is possible thh situation
does occur, I don't think a realistic test-case can be put together that
demonstrates it.

What 64K regions will do is reduce the amount of management data needed by
slab and reduce the number of calls to the page allocator. This is likely
to be much more significant in general than TLBs.

> How many access patterns are likely to soon touch
> exactly those objects that are in the same page?
> 
> Sure it is possible to come up with a scenario where it does help. But also
> others where it will not.
> 
> OTOH, if it is a win on ppc but not x86-64, then that may point to TLB...
> 

I'm not sure what it is, but I'm not convinced right now that TLB could make
a 1% difference. Maybe the cache miss figures will show something up and
if not, and there is nothign obvious in the profiles, I'll rerun with
TLB profiling and see what pops up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-05  3:59             ` Nick Piggin
  2009-02-05 13:49               ` Mel Gorman
@ 2009-02-16 18:42               ` Mel Gorman
  2009-02-16 19:17                 ` Pekka Enberg
  2009-02-16 19:25                 ` Pekka Enberg
  1 sibling, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2009-02-16 18:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter


Slightly later than hoped for, but here are the results of the profile
run between the different slab allocators. It also includes information on
the performance on SLUB with the allocator pass-thru logic reverted by commit
http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=97a4871761e735b6f1acd3bc7c3bac30dae3eab9
.

All the profiles are in a tar stored at
http://www.csn.ul.ie/~mel/postings/slXb-20090213/ . The layout of it is as
follows

o hydra is the x86-64 machine
o powyah is the ppc64 machine
o the next level of directories is for each benchmark
o speccpu06/profile contains spec config files. gen-m64base.cfg was used
o The result/ directory contains three types of files
	procinfo-before-*	Proc files and slabinfo before a benchmark
	procinfo-after-*	Proc files after
	OP.*			Oprofile output
  The files are indexed 001, 002 etc and are as follows
  	001 slab
	002 slub-default
	003 slub-minorder (max_order=1, min_objects=1)
	004 slqb
	005 slub-revert-passthru
o The sysbench-run2 directories are named based on the allocator tested
o The sysbench contain noprofile, profile and fine-profile directories
  fine-profile is probably of the most interest. There are files named
  oprofile-THREAD_COUNT-report.txt where THREAD_COUNT is in the range 1-16.

I haven't done much digging in here yet. Between the large page bug and
other patches in my inbox, I haven't had the chance yet but that doesn't
stop anyone else taking a look.

I'm reposting the performance as SLUB and SLQB had to be rebuilt with profile
stats enabled which potentially changed the result slightly.

x86-64 speccpu performance ratios
=================================
Integer tests
SPEC test       slab       slub  slub-min slub-rvrt      slqb
400.perlbench   1.0000   1.0472    1.0118    1.0393    1.0000
401.bzip2       1.0000   1.0071    0.9970    1.0041    1.0000
403.gcc         1.0000   0.9579    0.9952    0.9034    0.9588
429.mcf         1.0000   1.0089    0.9933    0.9376    0.9959
445.gobmk       1.0000   0.9881    1.0040    0.9894    0.9947
456.hmmer       1.0000   0.9911    1.1039    0.9911    0.9940
458.sjeng       1.0000   0.9958    0.9622    0.9691    0.9632
462.libquantum  1.0000   1.0576    0.9974    1.0817    1.0547
464.h264ref     1.0000   1.0230    1.0295    1.0165    1.0279
471.omnetpp     1.0000   1.0038    0.9511    0.9626    0.9557
473.astar       1.0000   0.9811    0.9957    0.9989    0.9957
483.xalancbmk   1.0000   1.0889    1.0135    1.0060    1.0715
---------------
specint geomean 1.0000   1.0119    1.0039    0.9907    1.0004

Floating Point Tests
SPEC test       slab       slub  slub-min slub-rvrt      slqb
410.bwaves      1.0000   1.0047    0.8369    0.8366    0.8366
416.gamess      1.0000   1.0038    1.0015    0.9977    1.0046
433.milc        1.0000   1.1416    1.1187    1.0182    1.1329
434.zeusmp      1.0000   1.0847    0.9991    1.0009    0.9974
435.gromacs     1.0000   1.0652    1.0000    1.0652    0.9989
436.cactusADM   1.0000   1.0057    0.9974    1.0979    1.0021
437.leslie3d    1.0000   1.0438    1.0387    1.0034    0.9919
444.namd        1.0000   1.0000    0.9713    0.9947    1.0027
447.dealII      1.0000   1.0000    0.9986    1.0071    0.9986
450.soplex      1.0000   0.9627    0.8958    1.0169    1.0028
453.povray      1.0000   0.9971    0.9457    0.9508    1.0029
454.calculix    1.0000   0.9994    0.9612    1.0012    0.9994
459.GemsFDTD    1.0000   1.0035    0.9094    1.0057    1.0014
465.tonto       1.0000   1.0155    1.0010    1.0000    1.0010
470.lbm         1.0000   1.2551    1.2406    1.2613    1.0008
481.wrf         1.0000   0.9971    0.9514    0.9527    0.9501
482.sphinx3     1.0000   1.0045    0.9994    1.0083    0.9994
---------------
specfp geomean  1.0000   1.0323    0.9886    1.0098    0.9941

x86-64 sysbench performance ratios
==================================
Client      slab     slub  slub-min  slub-rvrt       slqb 
     1    1.0000   1.0390    0.9698     1.0396     0.9803
     2    1.0000   1.0080    1.0008     0.9986     0.9967
     3    1.0000   1.0132    1.0032     0.9904     0.9947
     4    1.0000   1.0222    1.0059     0.9898     0.9914
     5    1.0000   1.0025    1.0144     0.9929     0.9869
     6    1.0000   0.9959    1.0118     1.0082     0.9974
     7    1.0000   1.0008    0.9805     0.9676     0.9829
     8    1.0000   0.9878    0.9875     0.9702     0.9850
     9    1.0000   1.0126    1.0322     0.9894     0.9966
    10    1.0000   0.9984    0.9968     0.9947     1.0265
    11    1.0000   1.0028    1.0086     0.9922     1.0215
    12    1.0000   1.0044    1.0044     0.9910     0.9965
    13    1.0000   0.9940    0.9929     0.9854     0.9849
    14    1.0000   0.9997    1.0127     0.9892     0.9870
    15    1.0000   0.9984    1.0044     0.9905     1.0029
    16    1.0000   0.9912    0.9878     1.0034     0.9912
Geo. mean 1.0000   1.0044    1.0007     0.9932     0.9951

ppc64 speccpu performance ratios
================================
Integer tests
SPEC test       slab       slub  slub-mi  slub-rvrt      slqb
400.perlbench   1.0000   1.0008    0.9954    1.0008    1.0008
401.bzip2       1.0000   0.9993    0.9980    0.9973    1.0000
403.gcc         1.0000   0.9983    0.9991    0.9975    0.9983
429.mcf         1.0000   1.0004    0.9954    1.0000    1.0004
445.gobmk       1.0000   1.0009    1.0028    1.0009    1.0018
456.hmmer       1.0000   1.0054    1.0040    0.9993    0.9987
458.sjeng       1.0000   0.9976    1.0006    1.0006    0.9976
462.libquantum  1.0000   1.0039    1.0051    1.0032    1.0039
464.h264ref     1.0000   0.9958    0.9988    0.9988    0.9994
471.omnetpp     1.0000   0.9935    0.9859    0.9886    0.9913
473.astar       1.0000   0.9978    1.0000    0.9978    1.0022
483.xalancbmk   1.0000   1.0008    1.0017    1.0000    1.0017
---------------
specint geomean 1.0000   0.9995    0.9989    0.9987    0.9997

Floating Point Tests
SPEC test       slab       slub  slub-min slub-rvrt      slqb
410.bwaves      1.0000   0.9135    1.0016    0.9851    0.9439
416.gamess      1.0000   0.9990    0.9956    0.9922    0.9942
433.milc        1.0000   1.0018    1.0030    1.0036    1.0036
434.zeusmp      1.0000   1.0000    1.0008    0.9984    1.0008
435.gromacs     1.0000   0.9992    0.9962    0.9970    0.9985
436.cactusADM   1.0000   0.9914    1.0029    0.9931    0.9931
437.leslie3d    1.0000   1.0016    1.0027    1.0000    1.0019
444.namd        1.0000   0.9834    0.9976    0.9976    0.9904
447.dealII      1.0000   1.0045    1.0018    1.0018    1.0027
450.soplex      1.0000   0.9981    0.9953    0.9943    0.9981
453.povray      1.0000   0.9963    1.0037    0.9927    0.9083
454.calculix    1.0000   0.9980    0.9992    0.9968    0.9988
459.GemsFDTD    1.0000   0.9991    0.9981    0.9994    0.9981
465.tonto       1.0000   1.0024    0.9923    0.9976    1.0000
470.lbm         1.0000   1.0000    0.9988    0.9988    1.0000
481.wrf         1.0000   0.9981    1.0005    0.9971    0.9971
482.sphinx3     1.0000   0.9983    0.9970    0.9970    1.0000
---------------
specfp geomean  1.0000   0.9930    0.9992    0.9966    0.9897

ppc64 sysbench performance ratios
==================================
Client      slab     slub  slub-min  slub-rvrt       slqb 
     1    1.0000   0.9723    0.9876     0.9882     0.9675
     2    1.0000   0.9878    1.0010     0.9901     0.9586
     3    1.0000   0.9732    1.0025     0.9915     0.9492
     4    1.0000   0.9680    1.0021     1.0023     0.9803
     5    1.0000   0.9762    0.9945     0.9861     0.9780
     6    1.0000   0.9773    1.0039     0.9976     0.9774
     7    1.0000   0.9699    1.0051     0.9895     0.9708
     8    1.0000   0.9789    1.0041     0.9864     0.9734
     9    1.0000   0.9622    0.9951     0.9790     0.9627
    10    1.0000   0.9688    1.0024     0.9621     0.9708
    11    1.0000   0.9701    1.0033     0.9872     0.9706
    12    1.0000   0.9698    0.9999     0.9871     0.9728
    13    1.0000   0.9677    0.9978     0.9816     0.9695
    14    1.0000   0.9729    1.0067     0.9903     0.9726
    15    1.0000   0.9756    1.0027     0.9906     0.9730
    16    1.0000   0.9655    0.9975     0.9804     0.9668
Geo. mean 1.0000   0.9722    1.0004     0.9868     0.9696


Cache misses
============

Based on the profiles, here are the cache profiles of speccpu at least. I
ran out of time for writing a reporting script for sysbench but all the
necessary data is in the tar. Remember that the ratios are of improvements
so a ration of 1.0463 implies 4.63% fewer cache misses than SLAB.

x86-64 speccpu cache-miss improvements
======================================
SPEC test         slab     slub slub-min slub-rvrt     slqb
perlbench       1.0000   1.0463   0.9861    1.0539   1.0147
bzip2           1.0000   1.0091   1.0051    1.0110   0.9925
gcc             1.0000   0.9579   0.9610    0.8922   0.9959
mcf             1.0000   1.0069   0.9970    0.9470   0.9786
gobmk           1.0000   0.9873   0.9942    0.9953   1.0032
hmmer           1.0000   0.7456   0.9739    1.0048   0.9373
sjeng           1.0000   1.0289   1.0154    1.1512   0.9695
libquantum      1.0000   1.0348   1.0010    1.0508   0.9971
h264ref         1.0000   1.0600   1.1158    1.1002   1.1486
omnetpp         1.0000   1.0014   0.9650    0.9687   0.9566
astar           1.0000   0.9867   1.0017    1.0045   1.0016
xalancbmk       1.0000   1.0935   1.0834    1.0090   1.0361
---------
specint geomean 1.0000   0.9925   1.0074    1.0136   1.0014

milc            1.0000   1.1239   1.1935    1.1025   1.1002
lbm             1.0000   1.2181   1.0002    1.2219   1.1871
sphinx3         1.0000   1.2743   1.0039    1.0107   1.2692
bwaves          1.0000   1.0145   0.8063    0.8042   0.8041
gamess          1.0000   1.0120   0.9974    0.9685   0.9914
zeusmp          1.0000   1.0769   0.9998    1.0013   1.0032
leslie3d        1.0000   1.0276   0.9558    1.0032   0.9901
GemsFDTD        1.0000   1.0052   1.0044    1.0039   0.9076
tonto           1.0000   0.9967   0.9778    0.9856   0.9887
gromacs         1.0000   1.0570   1.0017    1.0563   1.0008
cactusADM       1.0000   1.0117   1.0060    1.0786   0.9999
calculix        1.0000   1.0049   1.0022    1.0003   0.9469
wrf             1.0000   0.8324   0.9552    0.9675   0.9646
namd            1.0000   0.9892   1.0467    0.9985   0.9930
dealII          1.0000   1.0240   1.0105    1.0268   1.0097
soplex          1.0000   0.9731   1.0088    1.0131   0.9065
povray          1.0000   1.0100   1.0080    0.9757   0.9532
---------
specfp geomean  1.0000   1.0341   0.9962    1.0097   0.9959

ppc64 speccpu cache-miss improvements
=====================================
SPEC test         slab     slub slub-min slub-rvrt     slqb
perlbench       1.0000   1.0168   1.0065    1.0210   0.9777
bzip2           1.0000   1.0053   1.0304    0.9894   0.9885
gcc             1.0000   1.0008   1.0051    0.9974   1.0040
mcf             1.0000   0.9783   1.0045    0.9856   0.9717
gobmk           1.0000   1.0123   1.0197    1.0256   1.0274
hmmer           1.0000   0.9936   0.9741    0.9829   0.9961
sjeng           1.0000   0.9980   0.9839    1.0066   1.0197
libquantum      1.0000   1.0199   1.0020    0.9916   0.9752
h264ref         1.0000   1.0177   1.0064    1.0167   1.0258
omnetpp         1.0000   0.9904   0.9940    1.0002   0.9572
astar           1.0000   0.9926   1.0115    0.9946   0.9900
xalancbmk       1.0000   1.0131   1.0133    1.0090   1.0244
---------
specint geomean 1.0000   1.0032   1.0042    1.0016   0.9962

milc            1.0000   1.0140   1.0307    1.0317   1.0141
lbm             1.0000   0.9966   0.9971    1.0201   0.9811
sphinx3         1.0000   0.9904   0.9844    0.9871   0.9982
bwaves          1.0000   1.0106   1.0380    1.0071   1.0203
gamess          1.0000   1.0475   1.0286    1.0136   1.0194
zeusmp          1.0000   1.0274   1.0152    1.0284   1.0214
leslie3d        1.0000   0.9788   0.9640    1.0118   0.9583
GemsFDTD        1.0000   1.0236   1.0110    1.0201   0.9936
tonto           1.0000   1.0591   0.9458    0.9342   1.0341
gromacs         1.0000   1.0159   1.0037    0.9749   0.9904
cactusADM       1.0000   0.9946   1.0000    0.9914   1.0201
calculix        1.0000   1.0206   1.0313    1.0363   1.0331
wrf             1.0000   1.0222   1.0096    1.0606   1.0253
namd            1.0000   0.9862   0.9773    0.9557   0.9744
dealII          1.0000   1.0135   0.9904    1.0333   1.0205
soplex          1.0000   1.0213   1.0081    0.9998   0.9922
povray          1.0000   1.0201   1.0016    1.0419   1.0575
--------
specfp geomean  1.0000   1.0140   1.0019    1.0082   1.0088

Glancing through, it would appear that slub with default settings is often the
cache-friendlist, but it suffered badly on hmmer on the x86-64, probably an
accident of layout. While its cache usage for lbm and sphinx were drastically
improved for slub, it didn't translate into significantly better performance.
It's difficult to conclude anything from the cache figures other than no
allocator is obviously far worse than slab in terms of cache usage.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 18:42               ` Mel Gorman
@ 2009-02-16 19:17                 ` Pekka Enberg
  2009-02-16 19:41                   ` Mel Gorman
                                     ` (2 more replies)
  2009-02-16 19:25                 ` Pekka Enberg
  1 sibling, 3 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-02-16 19:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Mel,

Mel Gorman wrote:
> I haven't done much digging in here yet. Between the large page bug and
> other patches in my inbox, I haven't had the chance yet but that doesn't
> stop anyone else taking a look.

So how big does an improvement/regression have to be not to be 
considered within noise? I mean, I randomly picked one of the results 
("x86-64 speccpu integer tests") and ran it through my "summarize" 
script and got the following results:

		min      max      mean     std_dev
   slub		0.96     1.09     1.01     0.04
   slub-min	0.95     1.10     1.00     0.04
   slub-rvrt	0.90     1.08     0.99     0.05
   slqb		0.96     1.07     1.00     0.04

Apart from slub-rvrt (which seems to be regressing, interesting) all the 
allocators seem to perform equally well. Hmm?

Btw, Yanmin, do you have access to the tests Mel is running (especially 
the ones where slub-rvrt seems to do worse)? Can you see this kind of 
regression? The results make we wonder whether we should avoid reverting 
all of the page allocator pass-through and just add a kmalloc cache for 
8K allocations. Or not address the netperf regression at all. Double-hmm.

			Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 18:42               ` Mel Gorman
  2009-02-16 19:17                 ` Pekka Enberg
@ 2009-02-16 19:25                 ` Pekka Enberg
  2009-02-16 19:44                   ` Mel Gorman
  1 sibling, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-16 19:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Hi Mel,

On Mon, Feb 16, 2009 at 8:42 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> Slightly later than hoped for, but here are the results of the profile
> run between the different slab allocators. It also includes information on
> the performance on SLUB with the allocator pass-thru logic reverted by commit
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=97a4871761e735b6f1acd3bc7c3bac30dae3eab9

Did you just cherry-pick the patch or did you run it with the
topic/slub/perf branch? There's a follow-up patch from Yanmin which
will make a difference for large allocations when page-allocator
pass-through is reverted:

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=79b350ab63458ef1d11747b4f119baea96771a6e

                            Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:17                 ` Pekka Enberg
@ 2009-02-16 19:41                   ` Mel Gorman
  2009-02-16 19:43                     ` Pekka Enberg
  2009-02-17  1:06                   ` Zhang, Yanmin
  2009-02-17 16:20                   ` Christoph Lameter
  2 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2009-02-16 19:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Mon, Feb 16, 2009 at 09:17:58PM +0200, Pekka Enberg wrote:
> Hi Mel,
>
> Mel Gorman wrote:
>> I haven't done much digging in here yet. Between the large page bug and
>> other patches in my inbox, I haven't had the chance yet but that doesn't
>> stop anyone else taking a look.
>
> So how big does an improvement/regression have to be not to be  
> considered within noise? I mean, I randomly picked one of the results  
> ("x86-64 speccpu integer tests") and ran it through my "summarize"  
> script and got the following results:
>
> 		min      max      mean     std_dev
>   slub		0.96     1.09     1.01     0.04
>   slub-min	0.95     1.10     1.00     0.04
>   slub-rvrt	0.90     1.08     0.99     0.05
>   slqb		0.96     1.07     1.00     0.04
>

Well, it doesn't make a whole pile of sense to get the average of these ratios
or the deviation between them. Each of the tests behave very differently. I'd
consider anything over 0.5% significant but I also have to admit I wasn't
doing multiple runs this time due to the length of time it takes. In a
previous test, I ran them 3 times each and didn't spot large deviations.

> Apart from slub-rvrt (which seems to be regressing, interesting) all the  
> allocators seem to perform equally well. Hmm?
>

For this stuff, they are reasonably close but I don't believe thye are
allocator intensive either. SPEC CPU was brought up as a workload HPC people
would care about. Bear in mind it's also not testing NUMA or CPU scalability
really well. It's one data-point. netperf is a much more allocator intensive
workload.

> Btw, Yanmin, do you have access to the tests Mel is running (especially  
> the ones where slub-rvrt seems to do worse)? Can you see this kind of  
> regression? The results make we wonder whether we should avoid reverting  
> all of the page allocator pass-through and just add a kmalloc cache for  
> 8K allocations. Or not address the netperf regression at all. Double-hmm.
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:44                   ` Mel Gorman
@ 2009-02-16 19:42                     ` Pekka Enberg
  0 siblings, 0 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-02-16 19:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Mel Gorman wrote:
>> There's a follow-up patch from Yanmin which
>> will make a difference for large allocations when page-allocator
>> pass-through is reverted:
>>
>> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=79b350ab63458ef1d11747b4f119baea96771a6e
> 
> Is this expected to make a difference to workloads that are not that
> allocator intensive? I doubt it'll make much different to speccpu but
> conceivably it makes a difference to sysbench.

I doubt that too but I fail to see why it's regressing with the revert 
in the first place for speccpu. Maybe it's cache effects, dunno.

			Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:41                   ` Mel Gorman
@ 2009-02-16 19:43                     ` Pekka Enberg
  0 siblings, 0 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-02-16 19:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

Mel Gorman wrote:
> On Mon, Feb 16, 2009 at 09:17:58PM +0200, Pekka Enberg wrote:
>> Hi Mel,
>>
>> Mel Gorman wrote:
>>> I haven't done much digging in here yet. Between the large page bug and
>>> other patches in my inbox, I haven't had the chance yet but that doesn't
>>> stop anyone else taking a look.
>> So how big does an improvement/regression have to be not to be  
>> considered within noise? I mean, I randomly picked one of the results  
>> ("x86-64 speccpu integer tests") and ran it through my "summarize"  
>> script and got the following results:
>>
>> 		min      max      mean     std_dev
>>   slub		0.96     1.09     1.01     0.04
>>   slub-min	0.95     1.10     1.00     0.04
>>   slub-rvrt	0.90     1.08     0.99     0.05
>>   slqb		0.96     1.07     1.00     0.04
>>
> 
> Well, it doesn't make a whole pile of sense to get the average of these ratios
> or the deviation between them. Each of the tests behave very differently.

Uhm, yes. I need to learn to read one of these days.

			Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:25                 ` Pekka Enberg
@ 2009-02-16 19:44                   ` Mel Gorman
  2009-02-16 19:42                     ` Pekka Enberg
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2009-02-16 19:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin, Christoph Lameter

On Mon, Feb 16, 2009 at 09:25:35PM +0200, Pekka Enberg wrote:
> Hi Mel,
> 
> On Mon, Feb 16, 2009 at 8:42 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > Slightly later than hoped for, but here are the results of the profile
> > run between the different slab allocators. It also includes information on
> > the performance on SLUB with the allocator pass-thru logic reverted by commit
> > http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=97a4871761e735b6f1acd3bc7c3bac30dae3eab9
> 
> Did you just cherry-pick the patch or did you run it with the
> topic/slub/perf branch?

Cherry picked to minimise the number of factors involved.

> There's a follow-up patch from Yanmin which
> will make a difference for large allocations when page-allocator
> pass-through is reverted:
> 
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=79b350ab63458ef1d11747b4f119baea96771a6e
> 

Is this expected to make a difference to workloads that are not that
allocator intensive? I doubt it'll make much different to speccpu but
conceivably it makes a difference to sysbench.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:17                 ` Pekka Enberg
  2009-02-16 19:41                   ` Mel Gorman
@ 2009-02-17  1:06                   ` Zhang, Yanmin
  2009-02-17 16:20                   ` Christoph Lameter
  2 siblings, 0 replies; 55+ messages in thread
From: Zhang, Yanmin @ 2009-02-17  1:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Christoph Lameter

On Mon, 2009-02-16 at 21:17 +0200, Pekka Enberg wrote:
> Hi Mel,
> 
> Mel Gorman wrote:
> > I haven't done much digging in here yet. Between the large page bug and
> > other patches in my inbox, I haven't had the chance yet but that doesn't
> > stop anyone else taking a look.
> 
> So how big does an improvement/regression have to be not to be 
> considered within noise? I mean, I randomly picked one of the results 
> ("x86-64 speccpu integer tests") and ran it through my "summarize" 
> script and got the following results:
> 
> 		min      max      mean     std_dev
>    slub		0.96     1.09     1.01     0.04
>    slub-min	0.95     1.10     1.00     0.04
>    slub-rvrt	0.90     1.08     0.99     0.05
>    slqb		0.96     1.07     1.00     0.04
> 
> Apart from slub-rvrt (which seems to be regressing, interesting) all the 
> allocators seem to perform equally well. Hmm?
I wonder if different compilation of kernel might cause different cache alignment
which has much impact on small result difference.

If a workload isn't slab-allocation intensive, perhaps the impact caused by different
compilation is a little bigger.


> 
> Btw, Yanmin, do you have access to the tests Mel is running (especially 
> the ones where slub-rvrt seems to do worse)?
As it takes a long time (more than 20 hours) to run cpu2006, I run cpu2000 instead
of cpu2006. Now, we are trying to integrate cpu2006 into testing infrastructure.
Let me check it firstly.

>  Can you see this kind of 
> regression? The results make we wonder whether we should avoid reverting 
> all of the page allocator pass-through and just add a kmalloc cache for 
> 8K allocations. Or not address the netperf regression at all. Double-hmm.
> 
> 			Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-16 19:17                 ` Pekka Enberg
  2009-02-16 19:41                   ` Mel Gorman
  2009-02-17  1:06                   ` Zhang, Yanmin
@ 2009-02-17 16:20                   ` Christoph Lameter
  2009-02-17 17:01                     ` Pekka Enberg
  2 siblings, 1 reply; 55+ messages in thread
From: Christoph Lameter @ 2009-02-17 16:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Mon, 16 Feb 2009, Pekka Enberg wrote:

> Btw, Yanmin, do you have access to the tests Mel is running (especially the
> ones where slub-rvrt seems to do worse)? Can you see this kind of regression?
> The results make we wonder whether we should avoid reverting all of the page
> allocator pass-through and just add a kmalloc cache for 8K allocations. Or not
> address the netperf regression at all. Double-hmm.


Going to 8k for the limit beyond we pass through to the page allocator may
be the simplest and best solution. Someone please work on the page
allocator...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 16:20                   ` Christoph Lameter
@ 2009-02-17 17:01                     ` Pekka Enberg
  2009-02-17 17:05                       ` Christoph Lameter
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-17 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

Hi Christoph,

On Mon, 16 Feb 2009, Pekka Enberg wrote:
> > Btw, Yanmin, do you have access to the tests Mel is running (especially the
> > ones where slub-rvrt seems to do worse)? Can you see this kind of regression?
> > The results make we wonder whether we should avoid reverting all of the page
> > allocator pass-through and just add a kmalloc cache for 8K allocations. Or not
> > address the netperf regression at all. Double-hmm.

On Tue, 2009-02-17 at 11:20 -0500, Christoph Lameter wrote:
> Going to 8k for the limit beyond we pass through to the page allocator may
> be the simplest and best solution. Someone please work on the page
> allocator...

Yeah. Something like this totally untested patch, perhaps?

			Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..e93cb3d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -201,6 +201,13 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 #define SLUB_DMA (__force gfp_t)0
 #endif
 
+/*
+ * The maximum allocation size that will be satisfied by the slab allocator for
+ * kmalloc(). Requests that exceed this limit are passed directly to the page
+ * allocator.
+ */
+#define SLAB_LIMIT (8 * 1024)
+
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
@@ -212,7 +219,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
+		if (size > SLAB_LIMIT)
 			return kmalloc_large(size, flags);
 
 		if (!(flags & SLUB_DMA)) {
@@ -234,7 +241,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) &&
-		size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
+		size <= SLAB_LIMIT && !(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
 		if (!s)
diff --git a/mm/slub.c b/mm/slub.c
index 0280eee..a324188 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2658,7 +2658,7 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLAB_LIMIT))
 		return kmalloc_large(size, flags);
 
 	s = get_slab(size, flags);
@@ -2686,7 +2686,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLAB_LIMIT))
 		return kmalloc_large_node(size, flags, node);
 
 	s = get_slab(size, flags);
@@ -3223,7 +3223,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLAB_LIMIT))
 		return kmalloc_large(size, gfpflags);
 
 	s = get_slab(size, gfpflags);
@@ -3239,7 +3239,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLAB_LIMIT))
 		return kmalloc_large_node(size, gfpflags, node);
 
 	s = get_slab(size, gfpflags);



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 17:01                     ` Pekka Enberg
@ 2009-02-17 17:05                       ` Christoph Lameter
  2009-02-17 17:24                         ` Pekka Enberg
                                           ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Christoph Lameter @ 2009-02-17 17:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

Well yes you missed two locations (kmalloc_caches array has to be
redimensioned) and I also was writing the same patch...

Here is mine:

Subject: SLUB: Do not pass 8k objects through to the page allocator

Increase the maximum object size in SLUB so that 8k objects are not
passed through to the page allocator anymore. The network stack uses 8k
objects for performance critical operations.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-02-17 10:45:51.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-02-17 11:06:53.000000000 -0600
@@ -121,10 +121,21 @@
 #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)

 /*
+ * Maximum kmalloc object size handled by SLUB. Larger object allocations
+ * are passed through to the page allocator. The page allocator "fastpath"
+ * is relatively slow so we need this value sufficiently high so that
+ * performance critical objects are allocated through the SLUB fastpath.
+ *
+ * This should be dropped to PAGE_SIZE / 2 once the page allocator
+ * "fastpath" becomes competitive with the slab allocator fastpaths.
+ */
+#define SLUB_MAX_SIZE (2 * PAGE_SIZE)
+
+/*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 2];

 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -212,7 +223,7 @@
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
+		if (size > SLUB_MAX_SIZE)
 			return kmalloc_large(size, flags);

 		if (!(flags & SLUB_DMA)) {
@@ -234,7 +245,7 @@
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) &&
-		size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
+		size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);

 		if (!s)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-02-17 10:49:47.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-02-17 10:58:14.000000000 -0600
@@ -2475,7 +2475,7 @@
  *		Kmalloc subsystem
  *******************************************************************/

-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[PAGE_SHIFT + 2] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);

 static int __init setup_slub_min_order(char *str)
@@ -2658,7 +2658,7 @@
 {
 	struct kmem_cache *s;

-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large(size, flags);

 	s = get_slab(size, flags);
@@ -2686,7 +2686,7 @@
 {
 	struct kmem_cache *s;

-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large_node(size, flags, node);

 	s = get_slab(size, flags);
@@ -3223,7 +3223,7 @@
 {
 	struct kmem_cache *s;

-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large(size, gfpflags);

 	s = get_slab(size, gfpflags);
@@ -3239,7 +3239,7 @@
 {
 	struct kmem_cache *s;

-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large_node(size, gfpflags, node);

 	s = get_slab(size, gfpflags);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 17:05                       ` Christoph Lameter
@ 2009-02-17 17:24                         ` Pekka Enberg
  2009-02-17 18:11                         ` Johannes Weiner
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-02-17 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> Well yes you missed two locations (kmalloc_caches array has to be
> redimensioned) and I also was writing the same patch...

:-)

On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> Subject: SLUB: Do not pass 8k objects through to the page allocator
> 
> Increase the maximum object size in SLUB so that 8k objects are not
> passed through to the page allocator anymore. The network stack uses 8k
> objects for performance critical operations.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Looks good to me. Yanmin, please retest netperf with this one instead if
you have the time. I'll replace the revert with this patch but keep your
default order tweak patch.

			Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 17:05                       ` Christoph Lameter
  2009-02-17 17:24                         ` Pekka Enberg
@ 2009-02-17 18:11                         ` Johannes Weiner
  2009-02-17 19:43                           ` Pekka Enberg
  2009-02-18  1:05                         ` Zhang, Yanmin
  2009-02-19  8:40                         ` Pekka Enberg
  3 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2009-02-17 18:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, Feb 17, 2009 at 12:05:07PM -0500, Christoph Lameter wrote:
> Well yes you missed two locations (kmalloc_caches array has to be
> redimensioned) and I also was writing the same patch...
> 
> Here is mine:
> 
> Subject: SLUB: Do not pass 8k objects through to the page allocator
> 
> Increase the maximum object size in SLUB so that 8k objects are not
> passed through to the page allocator anymore. The network stack uses 8k
> objects for performance critical operations.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2009-02-17 10:45:51.000000000 -0600
> +++ linux-2.6/include/linux/slub_def.h	2009-02-17 11:06:53.000000000 -0600
> @@ -121,10 +121,21 @@
>  #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
> 
>  /*
> + * Maximum kmalloc object size handled by SLUB. Larger object allocations
> + * are passed through to the page allocator. The page allocator "fastpath"
> + * is relatively slow so we need this value sufficiently high so that
> + * performance critical objects are allocated through the SLUB fastpath.
> + *
> + * This should be dropped to PAGE_SIZE / 2 once the page allocator
> + * "fastpath" becomes competitive with the slab allocator fastpaths.
> + */
> +#define SLUB_MAX_SIZE (2 * PAGE_SIZE)

This relies on PAGE_SIZE being 4k.  If you want 8k, why don't you say
so?  Pekka did this explicitely.

	Hannes

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 18:11                         ` Johannes Weiner
@ 2009-02-17 19:43                           ` Pekka Enberg
  2009-02-17 20:04                             ` Christoph Lameter
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-17 19:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Lameter, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, Feb 17, 2009 at 12:05:07PM -0500, Christoph Lameter wrote:
>> Index: linux-2.6/include/linux/slub_def.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/slub_def.h   2009-02-17 10:45:51.000000000 -0600
>> +++ linux-2.6/include/linux/slub_def.h        2009-02-17 11:06:53.000000000 -0600
>> @@ -121,10 +121,21 @@
>>  #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
>>
>>  /*
>> + * Maximum kmalloc object size handled by SLUB. Larger object allocations
>> + * are passed through to the page allocator. The page allocator "fastpath"
>> + * is relatively slow so we need this value sufficiently high so that
>> + * performance critical objects are allocated through the SLUB fastpath.
>> + *
>> + * This should be dropped to PAGE_SIZE / 2 once the page allocator
>> + * "fastpath" becomes competitive with the slab allocator fastpaths.
>> + */
>> +#define SLUB_MAX_SIZE (2 * PAGE_SIZE)

On Tue, Feb 17, 2009 at 8:11 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> This relies on PAGE_SIZE being 4k.  If you want 8k, why don't you say
> so?  Pekka did this explicitely.

That could be a problem, sure. Especially for architecture that have 64 K pages.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 19:43                           ` Pekka Enberg
@ 2009-02-17 20:04                             ` Christoph Lameter
  2009-02-18  0:48                               ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Lameter @ 2009-02-17 20:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Johannes Weiner, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, 17 Feb 2009, Pekka Enberg wrote:

> >> +#define SLUB_MAX_SIZE (2 * PAGE_SIZE)
>
> On Tue, Feb 17, 2009 at 8:11 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > This relies on PAGE_SIZE being 4k.  If you want 8k, why don't you say
> > so?  Pekka did this explicitely.
>
> That could be a problem, sure. Especially for architecture that have 64 K pages.

You could likely put a complicated formula in there instead. But 2 *
PAGE_SIZE is simple and will work on all platforms regardless of pagesize.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 20:04                             ` Christoph Lameter
@ 2009-02-18  0:48                               ` KOSAKI Motohiro
  2009-02-18  8:09                                 ` Pekka Enberg
  0 siblings, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2009-02-18  0:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Pekka Enberg, Johannes Weiner, Mel Gorman,
	Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

> On Tue, 17 Feb 2009, Pekka Enberg wrote:
> 
> > >> +#define SLUB_MAX_SIZE (2 * PAGE_SIZE)
> >
> > On Tue, Feb 17, 2009 at 8:11 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > This relies on PAGE_SIZE being 4k.  If you want 8k, why don't you say
> > > so?  Pekka did this explicitely.
> >
> > That could be a problem, sure. Especially for architecture that have 64 K pages.
> 
> You could likely put a complicated formula in there instead. But 2 *
> PAGE_SIZE is simple and will work on all platforms regardless of pagesize.

I think 2 * PAGE_SIZE is best and the patch description is needed change.
it's because almost architecture use two pages for stack and current page
allocator don't have delayed consolidation mechanism for order-1 page.

In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation 
always fallback to page allocator and using 64K (4 times memory consumption!).

Am I misunderstand anything?




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 17:05                       ` Christoph Lameter
  2009-02-17 17:24                         ` Pekka Enberg
  2009-02-17 18:11                         ` Johannes Weiner
@ 2009-02-18  1:05                         ` Zhang, Yanmin
  2009-02-18  7:48                           ` Pekka Enberg
  2009-02-19  8:40                         ` Pekka Enberg
  3 siblings, 1 reply; 55+ messages in thread
From: Zhang, Yanmin @ 2009-02-18  1:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> Well yes you missed two locations (kmalloc_caches array has to be
> redimensioned) and I also was writing the same patch...
> 
> Here is mine:
> 
> Subject: SLUB: Do not pass 8k objects through to the page allocator
> 
> Increase the maximum object size in SLUB so that 8k objects are not
> passed through to the page allocator anymore. The network stack uses 8k
> objects for performance critical operations.
Kernel 2.6.29-rc2 panic with the patch.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.29-rc2slubstat8k #1
RIP: 0010:[<ffffffff8028fae3>]  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
RSP: 0018:ffff88022f865e20  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 000000000000063f RDI: ffffffff808096c7
RBP: 00000000000000d0 R08: 0000000000000004 R09: 000000000012e941
R10: 0000000000000002 R11: 0000000000000020 R12: ffffffff80991c48
R13: ffffffff809a9b43 R14: ffffffff809f8000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff80a13080(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88022f864000, task ffff88022f868000)
Stack:
 ffffffff809f43e0 0000000000000020 ffffffff809aa469 0000000000000086
 ffffffff809f8000 ffffffff809a9b43 ffffffff80aaae80 ffffffff809f43e0
 0000000000000020 ffffffff809aa469 0000000000000000 ffffffff809d86a0
Call Trace:
 [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
 [<ffffffff809a9b43>] ? unpack_to_rootfs+0x59/0x97f
 [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
 [<ffffffff809aa481>] ? populate_rootfs+0x18/0xdf
 [<ffffffff80209051>] ? _stext+0x51/0x120
 [<ffffffff802d69b2>] ? create_proc_entry+0x73/0x8a
 [<ffffffff802619c0>] ? register_irq_proc+0x92/0xaa
 [<ffffffff809a4896>] ? kernel_init+0x12e/0x188
 [<ffffffff8020ce3a>] ? child_rip+0xa/0x20
 [<ffffffff809a4768>] ? kernel_init+0x0/0x188
 [<ffffffff8020ce30>] ? child_rip+0x0/0x20
Code: be 3f 06 00 00 48 c7 c7 c7 96 80 80 e8 b8 e2 f9 ff e8 c5 c2 45 00 9c 5b fa 65 8b 04 25 24 00 00 00 48 98 49 8b 94 c4 e8  
RIP  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
 RSP <ffff88022f865e20>
CR2: 0000000000000000
---[ end trace a7919e7f17c0a725 ]---
swapper used greatest stack depth: 5376 bytes left
Kernel panic - not syncing: Attempted to kill init!

> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2009-02-17 10:45:51.000000000 -0600
> +++ linux-2.6/include/linux/slub_def.h	2009-02-17 11:06:53.000000000 -0600
> @@ -121,10 +121,21 @@
>  #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
> 
>  /*
> + * Maximum kmalloc object size handled by SLUB. Larger object allocations
> + * are passed through to the page allocator. The page allocator "fastpath"
> + * is relatively slow so we need this value sufficiently high so that
> + * performance critical objects are allocated through the SLUB fastpath.
> + *
> + * This should be dropped to PAGE_SIZE / 2 once the page allocator
> + * "fastpath" becomes competitive with the slab allocator fastpaths.
> + */
> +#define SLUB_MAX_SIZE (2 * PAGE_SIZE)
> +
> +/*
>   * We keep the general caches in an array of slab caches that are used for
>   * 2^x bytes of allocations.
>   */
> -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
> +extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 2];
> 
>  /*
>   * Sorry that the following has to be that ugly but some versions of GCC
> @@ -212,7 +223,7 @@
>  static __always_inline void *kmalloc(size_t size, gfp_t flags)
>  {
>  	if (__builtin_constant_p(size)) {
> -		if (size > PAGE_SIZE)
> +		if (size > SLUB_MAX_SIZE)
>  			return kmalloc_large(size, flags);
> 
>  		if (!(flags & SLUB_DMA)) {
> @@ -234,7 +245,7 @@
>  static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>  {
>  	if (__builtin_constant_p(size) &&
> -		size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
> +		size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
>  			struct kmem_cache *s = kmalloc_slab(size);
> 
>  		if (!s)
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2009-02-17 10:49:47.000000000 -0600
> +++ linux-2.6/mm/slub.c	2009-02-17 10:58:14.000000000 -0600
> @@ -2475,7 +2475,7 @@
>   *		Kmalloc subsystem
>   *******************************************************************/
> 
> -struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
> +struct kmem_cache kmalloc_caches[PAGE_SHIFT + 2] __cacheline_aligned;
>  EXPORT_SYMBOL(kmalloc_caches);
> 
>  static int __init setup_slub_min_order(char *str)
> @@ -2658,7 +2658,7 @@
>  {
>  	struct kmem_cache *s;
> 
> -	if (unlikely(size > PAGE_SIZE))
> +	if (unlikely(size > SLUB_MAX_SIZE))
>  		return kmalloc_large(size, flags);
> 
>  	s = get_slab(size, flags);
> @@ -2686,7 +2686,7 @@
>  {
>  	struct kmem_cache *s;
> 
> -	if (unlikely(size > PAGE_SIZE))
> +	if (unlikely(size > SLUB_MAX_SIZE))
>  		return kmalloc_large_node(size, flags, node);
> 
>  	s = get_slab(size, flags);
> @@ -3223,7 +3223,7 @@
>  {
>  	struct kmem_cache *s;
> 
> -	if (unlikely(size > PAGE_SIZE))
> +	if (unlikely(size > SLUB_MAX_SIZE))
>  		return kmalloc_large(size, gfpflags);
> 
>  	s = get_slab(size, gfpflags);
> @@ -3239,7 +3239,7 @@
>  {
>  	struct kmem_cache *s;
> 
> -	if (unlikely(size > PAGE_SIZE))
> +	if (unlikely(size > SLUB_MAX_SIZE))
>  		return kmalloc_large_node(size, gfpflags, node);
> 
>  	s = get_slab(size, gfpflags);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  1:05                         ` Zhang, Yanmin
@ 2009-02-18  7:48                           ` Pekka Enberg
  2009-02-18  8:43                             ` Zhang, Yanmin
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-18  7:48 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

Hi Yanmin,

On Wed, 2009-02-18 at 09:05 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> > Well yes you missed two locations (kmalloc_caches array has to be
> > redimensioned) and I also was writing the same patch...
> > 
> > Here is mine:
> > 
> > Subject: SLUB: Do not pass 8k objects through to the page allocator
> > 
> > Increase the maximum object size in SLUB so that 8k objects are not
> > passed through to the page allocator anymore. The network stack uses 8k
> > objects for performance critical operations.
> Kernel 2.6.29-rc2 panic with the patch.
> 
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> PGD 0 
> Oops: 0000 [#1] SMP 
> last sysfs file: 
> CPU 0 
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.29-rc2slubstat8k #1
> RIP: 0010:[<ffffffff8028fae3>]  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> RSP: 0018:ffff88022f865e20  EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000002
> RDX: 0000000000000000 RSI: 000000000000063f RDI: ffffffff808096c7
> RBP: 00000000000000d0 R08: 0000000000000004 R09: 000000000012e941
> R10: 0000000000000002 R11: 0000000000000020 R12: ffffffff80991c48
> R13: ffffffff809a9b43 R14: ffffffff809f8000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffffffff80a13080(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 1, threadinfo ffff88022f864000, task ffff88022f868000)
> Stack:
>  ffffffff809f43e0 0000000000000020 ffffffff809aa469 0000000000000086
>  ffffffff809f8000 ffffffff809a9b43 ffffffff80aaae80 ffffffff809f43e0
>  0000000000000020 ffffffff809aa469 0000000000000000 ffffffff809d86a0
> Call Trace:
>  [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
>  [<ffffffff809a9b43>] ? unpack_to_rootfs+0x59/0x97f
>  [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
>  [<ffffffff809aa481>] ? populate_rootfs+0x18/0xdf
>  [<ffffffff80209051>] ? _stext+0x51/0x120
>  [<ffffffff802d69b2>] ? create_proc_entry+0x73/0x8a
>  [<ffffffff802619c0>] ? register_irq_proc+0x92/0xaa
>  [<ffffffff809a4896>] ? kernel_init+0x12e/0x188
>  [<ffffffff8020ce3a>] ? child_rip+0xa/0x20
>  [<ffffffff809a4768>] ? kernel_init+0x0/0x188
>  [<ffffffff8020ce30>] ? child_rip+0x0/0x20
> Code: be 3f 06 00 00 48 c7 c7 c7 96 80 80 e8 b8 e2 f9 ff e8 c5 c2 45 00 9c 5b fa 65 8b 04 25 24 00 00 00 48 98 49 8b 94 c4 e8  
> RIP  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
>  RSP <ffff88022f865e20>
> CR2: 0000000000000000
> ---[ end trace a7919e7f17c0a725 ]---
> swapper used greatest stack depth: 5376 bytes left
> Kernel panic - not syncing: Attempted to kill init!

Aah, we need to fix up some more PAGE_SHIFTs in the code.

			Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..e217a7a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -121,10 +121,23 @@ struct kmem_cache {
 #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
 
 /*
+ * Maximum kmalloc object size handled by SLUB. Larger object allocations
+ * are passed through to the page allocator. The page allocator "fastpath"
+ * is relatively slow so we need this value sufficiently high so that
+ * performance critical objects are allocated through the SLUB fastpath.
+ *
+ * This should be dropped to PAGE_SIZE / 2 once the page allocator
+ * "fastpath" becomes competitive with the slab allocator fastpaths.
+ */
+#define SLUB_MAX_SIZE (2 * PAGE_SIZE)
+
+#define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
+
+/*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -212,7 +225,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
+		if (size > SLUB_MAX_SIZE)
 			return kmalloc_large(size, flags);
 
 		if (!(flags & SLUB_DMA)) {
@@ -234,7 +247,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) &&
-		size <= PAGE_SIZE && !(flags & SLUB_DMA)) {
+		size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
 		if (!s)
diff --git a/mm/slub.c b/mm/slub.c
index 0280eee..43a0c53 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2658,7 +2658,7 @@ void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large(size, flags);
 
 	s = get_slab(size, flags);
@@ -2686,7 +2686,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large_node(size, flags, node);
 
 	s = get_slab(size, flags);
@@ -2986,7 +2986,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3023,7 +3023,7 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3223,7 +3223,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large(size, gfpflags);
 
 	s = get_slab(size, gfpflags);
@@ -3239,7 +3239,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
+	if (unlikely(size > SLUB_MAX_SIZE))
 		return kmalloc_large_node(size, gfpflags, node);
 
 	s = get_slab(size, gfpflags);



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  0:48                               ` KOSAKI Motohiro
@ 2009-02-18  8:09                                 ` Pekka Enberg
  2009-02-19  0:05                                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-18  8:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

Hi!

On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> I think 2 * PAGE_SIZE is best and the patch description is needed change.
> it's because almost architecture use two pages for stack and current page
> allocator don't have delayed consolidation mechanism for order-1 page.

Do you mean alloc_thread_info()? Not all architectures use kmalloc() to
implement it so I'm not sure if that's relevant for this patch.

On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation 
> always fallback to page allocator and using 64K (4 times memory consumption!).

Yes, correct, but SLUB does that already by passing all allocations over
4K to the page allocator.

I'm not totally against 2 * PAGE_SIZE but I just worry that as SLUB
performance will be bound to architecture page size, we will see skewed
results in performance tests without realizing it. That's why I'm in
favor of a fixed size that's unified across architectures.

			Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  7:48                           ` Pekka Enberg
@ 2009-02-18  8:43                             ` Zhang, Yanmin
  2009-02-18  9:01                               ` Pekka Enberg
  0 siblings, 1 reply; 55+ messages in thread
From: Zhang, Yanmin @ 2009-02-18  8:43 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Wed, 2009-02-18 at 09:48 +0200, Pekka Enberg wrote:
> Hi Yanmin,
> 
> On Wed, 2009-02-18 at 09:05 +0800, Zhang, Yanmin wrote:
> > On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> > > Well yes you missed two locations (kmalloc_caches array has to be
> > > redimensioned) and I also was writing the same patch...
> > > 
> > > Here is mine:
> > > 
> > > Subject: SLUB: Do not pass 8k objects through to the page allocator
> > > 
> > > Increase the maximum object size in SLUB so that 8k objects are not
> > > passed through to the page allocator anymore. The network stack uses 8k
> > > objects for performance critical operations.
> > Kernel 2.6.29-rc2 panic with the patch.
> > 
> > BUG: unable to handle kernel NULL pointer dereference at (null)
> > IP: [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> > PGD 0 
> > Oops: 0000 [#1] SMP 
> > last sysfs file: 
> > CPU 0 
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.29-rc2slubstat8k #1
> > RIP: 0010:[<ffffffff8028fae3>]  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> > RSP: 0018:ffff88022f865e20  EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000002
> > RDX: 0000000000000000 RSI: 000000000000063f RDI: ffffffff808096c7
> > RBP: 00000000000000d0 R08: 0000000000000004 R09: 000000000012e941
> > R10: 0000000000000002 R11: 0000000000000020 R12: ffffffff80991c48
> > R13: ffffffff809a9b43 R14: ffffffff809f8000 R15: 0000000000000000
> > FS:  0000000000000000(0000) GS:ffffffff80a13080(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process swapper (pid: 1, threadinfo ffff88022f864000, task ffff88022f868000)
> > Stack:
> >  ffffffff809f43e0 0000000000000020 ffffffff809aa469 0000000000000086
> >  ffffffff809f8000 ffffffff809a9b43 ffffffff80aaae80 ffffffff809f43e0
> >  0000000000000020 ffffffff809aa469 0000000000000000 ffffffff809d86a0
> > Call Trace:
> >  [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
> >  [<ffffffff809a9b43>] ? unpack_to_rootfs+0x59/0x97f
> >  [<ffffffff809aa469>] ? populate_rootfs+0x0/0xdf
> >  [<ffffffff809aa481>] ? populate_rootfs+0x18/0xdf
> >  [<ffffffff80209051>] ? _stext+0x51/0x120
> >  [<ffffffff802d69b2>] ? create_proc_entry+0x73/0x8a
> >  [<ffffffff802619c0>] ? register_irq_proc+0x92/0xaa
> >  [<ffffffff809a4896>] ? kernel_init+0x12e/0x188
> >  [<ffffffff8020ce3a>] ? child_rip+0xa/0x20
> >  [<ffffffff809a4768>] ? kernel_init+0x0/0x188
> >  [<ffffffff8020ce30>] ? child_rip+0x0/0x20
> > Code: be 3f 06 00 00 48 c7 c7 c7 96 80 80 e8 b8 e2 f9 ff e8 c5 c2 45 00 9c 5b fa 65 8b 04 25 24 00 00 00 48 98 49 8b 94 c4 e8  
> > RIP  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> >  RSP <ffff88022f865e20>
> > CR2: 0000000000000000
> > ---[ end trace a7919e7f17c0a725 ]---
> > swapper used greatest stack depth: 5376 bytes left
> > Kernel panic - not syncing: Attempted to kill init!
> 
> Aah, we need to fix up some more PAGE_SHIFTs in the code.
The new patch fixes hang issue. netperf UDP-U-4k (start CPU_NUM clients) result is pretty good.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  8:43                             ` Zhang, Yanmin
@ 2009-02-18  9:01                               ` Pekka Enberg
  2009-02-18  9:19                                 ` Zhang, Yanmin
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-18  9:01 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Wed, 2009-02-18 at 16:43 +0800, Zhang, Yanmin wrote:
> > > Code: be 3f 06 00 00 48 c7 c7 c7 96 80 80 e8 b8 e2 f9 ff e8 c5 c2
> 45 00 9c 5b fa 65 8b 04 25 24 00 00 00 48 98 49 8b 94 c4 e8  
> > > RIP  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> > >  RSP <ffff88022f865e20>
> > > CR2: 0000000000000000
> > > ---[ end trace a7919e7f17c0a725 ]---
> > > swapper used greatest stack depth: 5376 bytes left
> > > Kernel panic - not syncing: Attempted to kill init!
> > 
> > Aah, we need to fix up some more PAGE_SHIFTs in the code.
> The new patch fixes hang issue. netperf UDP-U-4k (start CPU_NUM clients) result is pretty good.

Do you have your patch on top of it as well? Btw, can I add a Tested-by
tag from you to the patch?

			Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  9:01                               ` Pekka Enberg
@ 2009-02-18  9:19                                 ` Zhang, Yanmin
  0 siblings, 0 replies; 55+ messages in thread
From: Zhang, Yanmin @ 2009-02-18  9:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming

On Wed, 2009-02-18 at 11:01 +0200, Pekka Enberg wrote:
> On Wed, 2009-02-18 at 16:43 +0800, Zhang, Yanmin wrote:
> > > > Code: be 3f 06 00 00 48 c7 c7 c7 96 80 80 e8 b8 e2 f9 ff e8 c5 c2
> > 45 00 9c 5b fa 65 8b 04 25 24 00 00 00 48 98 49 8b 94 c4 e8  
> > > > RIP  [<ffffffff8028fae3>] kmem_cache_alloc+0x43/0x97
> > > >  RSP <ffff88022f865e20>
> > > > CR2: 0000000000000000
> > > > ---[ end trace a7919e7f17c0a725 ]---
> > > > swapper used greatest stack depth: 5376 bytes left
> > > > Kernel panic - not syncing: Attempted to kill init!
> > > 
> > > Aah, we need to fix up some more PAGE_SHIFTs in the code.
> > The new patch fixes hang issue. netperf UDP-U-4k (start CPU_NUM clients) result is pretty good.
> 
> Do you have your patch on top of it as well?
Yes.

>  Btw, can I add a Tested-by
> tag from you to the patch?
Ok. Another testing with UDP-U-4k (start 1 client and bind client and server to different
cpu) result is improved, but is not so good as SLQB's. But we can increase slub_max_order
to get the similiar result like SLQB.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-18  8:09                                 ` Pekka Enberg
@ 2009-02-19  0:05                                   ` KOSAKI Motohiro
  2009-02-19  9:16                                     ` Pekka Enberg
  0 siblings, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2009-02-19  0:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Christoph Lameter, Johannes Weiner, Mel Gorman,
	Nick Piggin, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

Hi Pekka,

> Hi!
> 
> On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> > I think 2 * PAGE_SIZE is best and the patch description is needed change.
> > it's because almost architecture use two pages for stack and current page
> > allocator don't have delayed consolidation mechanism for order-1 page.
> 
> Do you mean alloc_thread_info()? Not all architectures use kmalloc() to
> implement it so I'm not sure if that's relevant for this patch.
> 
> On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> > In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation 
> > always fallback to page allocator and using 64K (4 times memory consumption!).
> 
> Yes, correct, but SLUB does that already by passing all allocations over
> 4K to the page allocator.

hmhm
OK. my mail was pointless.

but why? In my understanding, slab framework mainly exist for efficient
sub-page allocation.
the fallbacking of 4K allocation in 64K page-sized architecture seems
inefficient.


> I'm not totally against 2 * PAGE_SIZE but I just worry that as SLUB
> performance will be bound to architecture page size, we will see skewed
> results in performance tests without realizing it. That's why I'm in
> favor of a fixed size that's unified across architectures.

fair point.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-17 17:05                       ` Christoph Lameter
                                           ` (2 preceding siblings ...)
  2009-02-18  1:05                         ` Zhang, Yanmin
@ 2009-02-19  8:40                         ` Pekka Enberg
  3 siblings, 0 replies; 55+ messages in thread
From: Pekka Enberg @ 2009-02-19  8:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Nick Piggin, Nick Piggin,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Lin Ming, Zhang, Yanmin

On Tue, 2009-02-17 at 12:05 -0500, Christoph Lameter wrote:
> Well yes you missed two locations (kmalloc_caches array has to be
> redimensioned) and I also was writing the same patch...
> 
> Here is mine:
> 
> Subject: SLUB: Do not pass 8k objects through to the page allocator
> 
> Increase the maximum object size in SLUB so that 8k objects are not
> passed through to the page allocator anymore. The network stack uses 8k
> objects for performance critical operations.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

This is merged now with my fixlets:

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=8573e12414365585bfd601dc8c093b3efbef8854


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-19  0:05                                   ` KOSAKI Motohiro
@ 2009-02-19  9:16                                     ` Pekka Enberg
  2009-02-19 12:51                                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-19  9:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> > > In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation 
> > > always fallback to page allocator and using 64K (4 times memory consumption!).
> > 
> > Yes, correct, but SLUB does that already by passing all allocations over
> > 4K to the page allocator.
> 
> hmhm
> OK. my mail was pointless.
> 
> but why? In my understanding, slab framework mainly exist for efficient
> sub-page allocation.
> the fallbacking of 4K allocation in 64K page-sized architecture seems
> inefficient.

I don't think any of the slab allocators are known for memory
efficiency. That said, the original patch description sums up the
rationale for page allocator pass-through:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=aadb4bc4a1f9108c1d0fbd121827c936c2ed4217

Interesting enough, there seems to be some performance gain from it as
well as seen by Mel Gorman's recent slab allocator benchmarks.

			Pekka


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-19  9:16                                     ` Pekka Enberg
@ 2009-02-19 12:51                                       ` KOSAKI Motohiro
  2009-02-19 13:15                                         ` Pekka Enberg
  0 siblings, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2009-02-19 12:51 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

2009/2/19 Pekka Enberg <penberg@cs.helsinki.fi>:
> On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
>> > > In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation
>> > > always fallback to page allocator and using 64K (4 times memory consumption!).
>> >
>> > Yes, correct, but SLUB does that already by passing all allocations over
>> > 4K to the page allocator.
>>
>> hmhm
>> OK. my mail was pointless.
>>
>> but why? In my understanding, slab framework mainly exist for efficient
>> sub-page allocation.
>> the fallbacking of 4K allocation in 64K page-sized architecture seems
>> inefficient.
>
> I don't think any of the slab allocators are known for memory
> efficiency. That said, the original patch description sums up the
> rationale for page allocator pass-through:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=aadb4bc4a1f9108c1d0fbd121827c936c2ed4217
>
> Interesting enough, there seems to be some performance gain from it as
> well as seen by Mel Gorman's recent slab allocator benchmarks.

Honestly, I'm bit confusing.
above url's patch use PAGE_SIZE, but not 4K nor architecture independent value.
Your 4K mean PAGE_SIZE?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-19 12:51                                       ` KOSAKI Motohiro
@ 2009-02-19 13:15                                         ` Pekka Enberg
  2009-02-19 13:49                                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2009-02-19 13:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

On Thu, 2009-02-19 at 21:51 +0900, KOSAKI Motohiro wrote:
> 2009/2/19 Pekka Enberg <penberg@cs.helsinki.fi>:
> > On Wed, 2009-02-18 at 09:48 +0900, KOSAKI Motohiro wrote:
> >> > > In addition, if pekka patch (SLAB_LIMIT = 8K) run on ia64, 16K allocation
> >> > > always fallback to page allocator and using 64K (4 times memory consumption!).
> >> >
> >> > Yes, correct, but SLUB does that already by passing all allocations over
> >> > 4K to the page allocator.
> >>
> >> hmhm
> >> OK. my mail was pointless.
> >>
> >> but why? In my understanding, slab framework mainly exist for efficient
> >> sub-page allocation.
> >> the fallbacking of 4K allocation in 64K page-sized architecture seems
> >> inefficient.
> >
> > I don't think any of the slab allocators are known for memory
> > efficiency. That said, the original patch description sums up the
> > rationale for page allocator pass-through:
> >
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=aadb4bc4a1f9108c1d0fbd121827c936c2ed4217
> >
> > Interesting enough, there seems to be some performance gain from it as
> > well as seen by Mel Gorman's recent slab allocator benchmarks.
> 
> Honestly, I'm bit confusing.
> above url's patch use PAGE_SIZE, but not 4K nor architecture independent value.
> Your 4K mean PAGE_SIZE?

Yes, I mean PAGE_SIZE. 4K page sizes are hard-wired into my brain,
sorry :-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-19 13:15                                         ` Pekka Enberg
@ 2009-02-19 13:49                                           ` KOSAKI Motohiro
  2009-02-19 14:19                                             ` Christoph Lameter
  0 siblings, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2009-02-19 13:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

>> Honestly, I'm bit confusing.
>> above url's patch use PAGE_SIZE, but not 4K nor architecture independent value.
>> Your 4K mean PAGE_SIZE?
>
> Yes, I mean PAGE_SIZE. 4K page sizes are hard-wired into my brain,
> sorry :-)

Thanks!
I'm recovered from confusing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch] SLQB slab allocator (try 2)
  2009-02-19 13:49                                           ` KOSAKI Motohiro
@ 2009-02-19 14:19                                             ` Christoph Lameter
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2009-02-19 14:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pekka Enberg, Johannes Weiner, Mel Gorman, Nick Piggin,
	Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Lin Ming, Zhang,
	Yanmin

What could be changed in the patch is to set SLUB_MAX_SIZE depending on
the page size of the underlying architecture.

#define SLUB_MAX_SIZE MAX(PAGE_SIZE, 8192)

So on 4k architectures SLUB_MAX_SIZE is set to 8192 and on 16k or 64k
arches its set to PAGE_SIZE.

And then define

#define SLUB_MAX_KMALLOC_ORDER get_order(SLUB_MAX_SIZE)

which will be 1 on 4k arches and 0 on higher sized arches.

Then also the kmalloc array would need to be dimensioned using
SLUB_MAX_KMALLOC_ORDER.


The definition of SLUB_NAX_KMALLOC_ORDER could be a bit challenging for
the C compiler.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2009-02-19 14:28 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-23 15:46 [patch] SLQB slab allocator (try 2) Nick Piggin
2009-01-24  2:38 ` Zhang, Yanmin
2009-01-26  8:48 ` Pekka Enberg
2009-01-26  9:07   ` Peter Zijlstra
2009-01-26  9:10     ` Peter Zijlstra
2009-01-26 17:22     ` Christoph Lameter
2009-01-27  9:07       ` Peter Zijlstra
2009-01-27 20:21         ` Christoph Lameter
2009-02-03  2:04           ` Nick Piggin
2009-02-03 10:12   ` Mel Gorman
2009-02-03 10:36     ` Nick Piggin
2009-02-03 11:22       ` Mel Gorman
2009-02-03 11:26         ` Mel Gorman
2009-02-04  6:48         ` Nick Piggin
2009-02-04 15:27           ` Mel Gorman
2009-02-05  3:59             ` Nick Piggin
2009-02-05 13:49               ` Mel Gorman
2009-02-16 18:42               ` Mel Gorman
2009-02-16 19:17                 ` Pekka Enberg
2009-02-16 19:41                   ` Mel Gorman
2009-02-16 19:43                     ` Pekka Enberg
2009-02-17  1:06                   ` Zhang, Yanmin
2009-02-17 16:20                   ` Christoph Lameter
2009-02-17 17:01                     ` Pekka Enberg
2009-02-17 17:05                       ` Christoph Lameter
2009-02-17 17:24                         ` Pekka Enberg
2009-02-17 18:11                         ` Johannes Weiner
2009-02-17 19:43                           ` Pekka Enberg
2009-02-17 20:04                             ` Christoph Lameter
2009-02-18  0:48                               ` KOSAKI Motohiro
2009-02-18  8:09                                 ` Pekka Enberg
2009-02-19  0:05                                   ` KOSAKI Motohiro
2009-02-19  9:16                                     ` Pekka Enberg
2009-02-19 12:51                                       ` KOSAKI Motohiro
2009-02-19 13:15                                         ` Pekka Enberg
2009-02-19 13:49                                           ` KOSAKI Motohiro
2009-02-19 14:19                                             ` Christoph Lameter
2009-02-18  1:05                         ` Zhang, Yanmin
2009-02-18  7:48                           ` Pekka Enberg
2009-02-18  8:43                             ` Zhang, Yanmin
2009-02-18  9:01                               ` Pekka Enberg
2009-02-18  9:19                                 ` Zhang, Yanmin
2009-02-19  8:40                         ` Pekka Enberg
2009-02-16 19:25                 ` Pekka Enberg
2009-02-16 19:44                   ` Mel Gorman
2009-02-16 19:42                     ` Pekka Enberg
2009-02-03 11:28       ` Mel Gorman
2009-02-03 11:50         ` Nick Piggin
2009-02-03 12:01           ` Mel Gorman
2009-02-03 12:07             ` Nick Piggin
2009-02-03 12:26               ` Mel Gorman
2009-02-04 15:49               ` Christoph Lameter
2009-02-04 15:48           ` Christoph Lameter
2009-02-03 18:58     ` Pekka Enberg
2009-02-04 16:06       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).