linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/21] Slab Fragmentation Reduction V12
@ 2008-05-10  3:08 Christoph Lameter
  2008-05-10  3:08 ` [patch 01/21] slub: Add defrag_ratio field and sysfs support Christoph Lameter
                   ` (20 more replies)
  0 siblings, 21 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
  use a timeout but track the amount of slab reclaim done by the shrinkers.
  Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
  /proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
  defrag (Either of those could be theoretically equipped to support
  slab defrag in some way but it seems that Andrew/Linus want to reduce
  the number of slab allocators).

Note that I am off till next Wednesday... my responsiveness will be limited
till then.

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim. Slab defragmentation adds the
capability to recover the memory that is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
  and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
  were calling into reclaim very frequently when the system was under
  memory pressure which slowed things down. Various measures to
  avoid scanning the partial list too frequently were added and the
  earlier (expensive) method of determining the defrag ratio of the slab
  cache as a whole was dropped. I think this addresses the issues that
  Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
  handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
  Targeted reclaim never triggers. This has to wait until we make
  slabs movable or we need to perform a special version of lumpy reclaim
  in SLUB while we scan the partial lists for slabs to kick out.
  Removal simplifies handling significantly since we
  get to slabs in a more controlled way via the partial lists.
  The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
  has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 01/21] slub: Add defrag_ratio field and sysfs support.
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 02/21] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0001-SLUB-Add-defrag_ratio-field-and-sysfs-support.patch --]
[-- Type: text/plain, Size: 2664 bytes --]

The defrag_ratio is used to set the threshold at which defragmentation
should be attempted on a slab page.

The allocation ratio is measured by the percentage of the available slots
allocated.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
slab defragmeentation runs.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/slub_def.h |    7 +++++++
 mm/slub.c                |   23 +++++++++++++++++++++++
 2 files changed, 30 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-05-05 18:27:21.212659137 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-05-05 18:48:42.745160406 -0700
@@ -88,6 +88,13 @@ struct kmem_cache {
 	void (*ctor)(struct kmem_cache *, void *);
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
+	int defrag_ratio;	/*
+				 * Ratio used to check the percentage of
+				 * objects allocate in a slab page.
+				 * If less than this ratio is allocated
+				 * then reclaim attempts are made.
+				 */
+
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-05 18:47:18.872659344 -0700
+++ linux-2.6/mm/slub.c	2008-05-05 18:48:42.745160406 -0700
@@ -2337,6 +2337,7 @@ static int kmem_cache_open(struct kmem_c
 		goto error;
 
 	s->refcount = 1;
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 100;
 #endif
@@ -4060,6 +4061,27 @@ static ssize_t free_calls_show(struct km
 }
 SLAB_ATTR_RO(free_calls);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = strict_strtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio < 100)
+		s->defrag_ratio = ratio;
+	return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -4167,6 +4189,7 @@ static struct attribute *slab_attrs[] = 
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 02/21] slub: Replace ctor field with ops field in /sys/slab/*
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
  2008-05-10  3:08 ` [patch 01/21] slub: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 03/21] slub: Add get() and kick() methods Christoph Lameter
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0002-SLUB-Replace-ctor-field-with-ops-field-in-sys-slab.patch --]
[-- Type: text/plain, Size: 1485 bytes --]

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/slub.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-05 18:48:42.745160406 -0700
+++ linux-2.6/mm/slub.c	2008-05-05 18:48:49.821409748 -0700
@@ -3832,16 +3832,18 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -4174,7 +4176,7 @@ static struct attribute *slab_attrs[] = 
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 03/21] slub: Add get() and kick() methods
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
  2008-05-10  3:08 ` [patch 01/21] slub: Add defrag_ratio field and sysfs support Christoph Lameter
  2008-05-10  3:08 ` [patch 02/21] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 04/21] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0003-SLUB-Add-get-and-kick-methods.patch --]
[-- Type: text/plain, Size: 5573 bytes --]

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the defrag methods for a slab cache.

Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for either allocator.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/slab.h     |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/slub_def.h |    3 ++
 mm/slub.c                |   29 ++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-05-09 18:44:10.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-05-09 18:44:13.000000000 -0700
@@ -86,6 +86,9 @@ struct kmem_cache {
 	gfp_t allocflags;	/* gfp flags to use on each alloc */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(struct kmem_cache *, void *);
+	kmem_defrag_get_func *get;
+	kmem_defrag_kick_func *kick;
+
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	int defrag_ratio;	/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-09 18:44:11.000000000 -0700
+++ linux-2.6/mm/slub.c	2008-05-09 18:44:13.000000000 -0700
@@ -2772,6 +2772,19 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
+{
+	/*
+	 * Defragmentable slabs must have a ctor otherwise objects may be
+	 * in an undetermined state after they are allocated.
+	 */
+	BUG_ON(!s->ctor);
+	s->get = get;
+	s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3057,7 +3070,7 @@ static int slab_unmergeable(struct kmem_
 	if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor)
+	if (s->ctor || s->kick || s->get)
 		return 1;
 
 	/*
@@ -3841,6 +3854,20 @@ static ssize_t ops_show(struct kmem_cach
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2008-05-09 18:41:42.000000000 -0700
+++ linux-2.6/include/linux/slab.h	2008-05-09 18:44:30.000000000 -0700
@@ -101,6 +101,56 @@ void kfree(const void *);
 size_t ksize(const void *);
 
 /*
+ * Function prototypes passed to kmem_cache_defrag() to enable defragmentation
+ * and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_defrag_get_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_defrag_get_func()
+ * is to obtain a stable refcount on the objects, so that they cannot be
+ * removed until kmem_cache_kick_func() has handled them.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects for which we need references.
+ *
+ * Returns a pointer that is passed to the kick function. If any objects
+ * cannot be moved then the pointer may indicate a failure and
+ * then kick can simply remove the references that were already obtained.
+ *
+ * The object pointer array passed is also passed to kmem_cache_defrag_kick().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is already about
+ * to be removed. In that case it is often impossible to obtain the necessary
+ * refcount.
+ */
+typedef void *kmem_defrag_get_func(struct kmem_cache *, int, void **);
+
+/*
+ * kmem_cache_defrag_kick_func is called with no locks held and interrupts
+ * enabled. Sleeping is possible. Any operation may be performed in kick().
+ * kmem_cache_defrag should free all the objects in the pointer array.
+ *
+ * Parameters passed are the number of objects in the array, the array of
+ * pointers to the objects and the pointer returned by kmem_cache_defrag_get().
+ *
+ * Success is checked by examining the number of remaining objects in the slab.
+ */
+typedef void kmem_defrag_kick_func(struct kmem_cache *, int, void **, void *);
+
+/*
+ * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
+						kmem_defrag_kick_func);
+#else
+static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+#endif
+
+/*
  * Allocator specific definitions. These are mainly used to establish optimized
  * ways to convert kmalloc() calls to kmem_cache_alloc() invocations by
  * selecting the appropriate general cache at compile time.

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 04/21] slub: Sort slab cache list and establish maximum objects for defrag slabs
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-05-10  3:08 ` [patch 03/21] slub: Add get() and kick() methods Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 05/21] slub: Slab defrag core Christoph Lameter
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0004-SLUB-Sort-slab-cache-list-and-establish-maximum-obj.patch --]
[-- Type: text/plain, Size: 2727 bytes --]

When defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that there is
no need to scan the complete list. Put defragmentable caches first when adding
a slab cache and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/slub.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-05 18:48:53.261409952 -0700
+++ linux-2.6/mm/slub.c	2008-05-05 18:48:56.662660182 -0700
@@ -205,6 +205,9 @@ static enum {
 static DECLARE_RWSEM(slub_lock);
 static LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
 /*
  * Tracking user of a slab.
  */
@@ -2544,7 +2547,7 @@ static struct kmem_cache *create_kmalloc
 								flags, NULL))
 		goto panic;
 
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	up_write(&slub_lock);
 	if (sysfs_slab_add(s))
 		goto panic;
@@ -2772,9 +2775,23 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+	return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+		BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+		GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
 {
+	int max_objects = oo_objects(s->max);
+
 	/*
 	 * Defragmentable slabs must have a ctor otherwise objects may be
 	 * in an undetermined state after they are allocated.
@@ -2782,6 +2799,11 @@ void kmem_cache_setup_defrag(struct kmem
 	BUG_ON(!s->ctor);
 	s->get = get;
 	s->kick = kick;
+	down_write(&slub_lock);
+	list_move(&s->list, &slab_caches);
+	if (max_objects > max_defrag_slab_objects)
+		max_defrag_slab_objects = max_objects;
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
@@ -3160,7 +3182,7 @@ struct kmem_cache *kmem_cache_create(con
 	if (s) {
 		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor)) {
-			list_add(&s->list, &slab_caches);
+			list_add_tail(&s->list, &slab_caches);
 			up_write(&slub_lock);
 			if (sysfs_slab_add(s))
 				goto err;

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 05/21] slub: Slab defrag core
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-05-10  3:08 ` [patch 04/21] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 06/21] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0005-SLUB-Slab-defrag-core.patch --]
[-- Type: text/plain, Size: 13040 bytes --]

Slab defragmentation may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Through the use of the slabinfo command.

3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called
   (can be called from reclaim code with a later patch).

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab page can accomodate.

   The scanning of slab caches is optimized because the
   defragmentable slabs come first on the list. Thus we can terminate scans
   on the first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free() attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect such situations and void the attempts to free such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page that contains the object is taken. Any attempt to perform
	a slab operation may lead to a deadlock.

	kmem_defrag_get_func returns a private pointer that is passed to
	kmem_defrag_kick_func(). Should we be unable to obtain all references
	then that pointer may indicate to the kick() function that it should
	not attempt any object removal or move but simply remove the
	reference counts.

kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
							void *get_result))

	After SLUB has established references to the objects in a
	slab it will then drop all locks and use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via kmem_defrag_get_func(). The
	callback may perform any slab operation since no locks are held at
	the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab. A filled up slab can also be removed
	from the partial list. So there could be a double effect.

	kmem_defrag_kick_func() does not return a result. SLUB will check
	the number of remaining objects in the slab. If all objects were
	removed then the operation was successful.

[penberg@cs.helsinki.fi: fix up locking in __kmem_cache_shrink()]
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/slab.h |    3 
 mm/slub.c            |  265 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 215 insertions(+), 53 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-09 19:22:35.536024323 -0700
+++ linux-2.6/mm/slub.c	2008-05-09 19:59:51.444558176 -0700
@@ -159,10 +159,10 @@ static inline void ClearSlabDebug(struct
 
 /*
  * Maximum number of desirable partial slabs.
- * The existence of more partial slabs makes kmem_cache_shrink
- * sort the partial list by the number of objects in the.
+ * More slabs cause kmem_cache_shrink to sort the slabs by objects
+ * and triggers slab defragmentation.
  */
-#define MAX_PARTIAL 10
+#define MAX_PARTIAL 20
 
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
@@ -2808,76 +2808,235 @@ void kmem_cache_setup_defrag(struct kmem
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
  *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
  */
-int kmem_cache_shrink(struct kmem_cache *s)
+static int kmem_cache_vacate(struct page *page, void *scratch)
 {
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	struct kmem_cache *s;
+	unsigned long *map;
+	int leftover;
+	int count;
+	void *private;
 	unsigned long flags;
+	unsigned long objects;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	local_irq_save(flags);
+	slab_lock(page);
 
-	flush_all(s);
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		n = get_node(s, node);
+	BUG_ON(!PageSlab(page));	/* Must be s slab page */
+	BUG_ON(!SlabFrozen(page));	/* Slab must have been frozen earlier */
+
+	s = page->slab;
+	objects = page->objects;
+	map = scratch + objects * sizeof(void **);
+	if (!page->inuse || !s->kick)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	/* Build vector of pointers to objects */
+	count = 0;
+	memset(vector, 0, objects * sizeof(void **));
+	for_each_object(p, s, addr, objects)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[count++] = p;
+
+	private = s->get(s, count, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->kick(s, count, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page, leftover > 0);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Remove objects from a list of slab pages that have been gathered.
+ * Must be called with slabs that have been isolated before.
+ *
+ * kmem_cache_reclaim() is never called from an atomic context. It
+ * allocates memory for temporary storage. We are holding the
+ * slub_lock semaphore which prevents another call into
+ * the defrag logic.
+ */
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+	int freed = 0;
+	void **scratch;
+	struct page *page;
+	struct page *page2;
+
+	if (list_empty(zaplist))
+		return 0;
+
+	scratch = alloc_scratch();
+	if (!scratch)
+		return 0;
+
+	list_for_each_entry_safe(page, page2, zaplist, lru) {
+		list_del(&page->lru);
+		if (kmem_cache_vacate(page, scratch) == 0)
+			freed++;
+	}
+	kfree(scratch);
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than the configured percentage of objects allocated.
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s, int node,
+							unsigned long limit)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed = 0;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		if (!n->nr_partial)
+	if (n->nr_partial <= limit)
+		return 0;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+		if (page->inuse) {
+			if (page->inuse * 100 >=
+					s->defrag_ratio * page->objects) {
+				slab_unlock(page);
+				/* Slab contains enough objects */
+				continue;
+			}
 
-		spin_lock_irqsave(&n->list_lock, flags);
+			list_move(&page->lru, &zaplist);
+			if (s->kick) {
+				n->nr_partial--;
+				SetSlabFrozen(page);
+			}
+			slab_unlock(page);
+		} else {
+			/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		}
+	}
 
+	if (!s->kick)
 		/*
-		 * Build lists indexed by the items in use in each slab.
+		 * No defrag methods. By simply putting the zaplist at the
+		 * end of the partial list we can let them simmer longer
+		 * and thus increase the chance of all objects being
+		 * reclaimed.
 		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
+		 * We have effectively sorted the partial list and put
+		 * the slabs with more objects first. As soon as they
+		 * are allocated they are going to be removed from the
+		 * partial list.
 		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
-			}
-		}
+		list_splice(&zaplist, n->partial.prev);
+
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->kick)
+		freed += kmem_cache_reclaim(&zaplist);
+
+	return freed;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long slabs = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		unsigned long reclaimed;
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * Defragmentable caches come first. If the slab cache is not
+		 * defragmentable then we can stop traversing the list.
 		 */
-		for (i = objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->kick)
+			break;
 
-		spin_unlock_irqrestore(&n->list_lock, flags);
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				reclaimed = __kmem_cache_shrink(s, nid,
+								MAX_PARTIAL);
+		} else
+			reclaimed = __kmem_cache_shrink(s, node, MAX_PARTIAL);
+
+		slabs += reclaimed;
 	}
+	up_read(&slub_lock);
+	return slabs;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache supports defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		__kmem_cache_shrink(s, node, 0);
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2008-05-09 19:22:35.520024066 -0700
+++ linux-2.6/include/linux/slab.h	2008-05-09 19:57:22.946115704 -0700
@@ -141,13 +141,16 @@ typedef void kmem_defrag_kick_func(struc
 
 /*
  * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ * kmem_cache_defrag() performs the actual defragmentation.
  */
 #ifdef CONFIG_SLUB
 void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
 						kmem_defrag_kick_func);
+int kmem_cache_defrag(int node);
 #else
 static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+static inline int kmem_cache_defrag(int node) { return 0; }
 #endif
 
 /*

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 06/21] slub: Add KICKABLE to avoid repeated kick() attempts
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-05-10  3:08 ` [patch 05/21] slub: Slab defrag core Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 07/21] slub: Extend slabinfo to support -D and -F options Christoph Lameter
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0006-SLUB-Add-KICKABLE-to-avoid-repeated-kick-attempts.patch --]
[-- Type: text/plain, Size: 3168 bytes --]

Add a flag KICKABLE to be set on slabs with a defragmentation method

Clear the flag if a kick action is not successful in reducing the
number of objects in a slab. This will avoid future attempts to
kick objects out.

The KICKABLE flag is set again when all objects of the slab have been
allocated (Occurs during removal of a slab from the partial lists).

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/slub.c |   35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-05 18:49:00.043910468 -0700
+++ linux-2.6/mm/slub.c	2008-05-05 18:49:10.851410275 -0700
@@ -103,6 +103,7 @@
  */
 
 #define FROZEN (1 << PG_active)
+#define KICKABLE (1 << PG_dirty)
 
 #ifdef CONFIG_SLUB_DEBUG
 #define SLABDEBUG (1 << PG_error)
@@ -140,6 +141,21 @@ static inline void ClearSlabDebug(struct
 	page->flags &= ~SLABDEBUG;
 }
 
+static inline int SlabKickable(struct page *page)
+{
+	return page->flags & KICKABLE;
+}
+
+static inline void SetSlabKickable(struct page *page)
+{
+	page->flags |= KICKABLE;
+}
+
+static inline void ClearSlabKickable(struct page *page)
+{
+	page->flags &= ~KICKABLE;
+}
+
 /*
  * Issues still to be resolved:
  *
@@ -1163,6 +1179,9 @@ static struct page *new_slab(struct kmem
 			SLAB_STORE_USER | SLAB_TRACE))
 		SetSlabDebug(page);
 
+	if (s->kick)
+		SetSlabKickable(page);
+
 	start = page_address(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
@@ -1203,6 +1222,7 @@ static void __free_slab(struct kmem_cach
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	ClearSlabKickable(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	__free_pages(page, order);
@@ -1412,6 +1432,8 @@ static void unfreeze_slab(struct kmem_ca
 			stat(c, DEACTIVATE_FULL);
 			if (SlabDebug(page) && (s->flags & SLAB_STORE_USER))
 				add_full(n, page);
+			if (s->kick)
+				SetSlabKickable(page);
 		}
 		slab_unlock(page);
 	} else {
@@ -2836,7 +2858,7 @@ static int kmem_cache_vacate(struct page
 	s = page->slab;
 	objects = page->objects;
 	map = scratch + objects * sizeof(void **);
-	if (!page->inuse || !s->kick)
+	if (!page->inuse || !s->kick || !SlabKickable(page))
 		goto out;
 
 	/* Determine used objects */
@@ -2874,6 +2896,9 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
+	if (leftover)
+		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		ClearSlabKickable(page);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -2930,10 +2955,14 @@ static unsigned long __kmem_cache_shrink
 			continue;
 
 		if (page->inuse) {
-			if (page->inuse * 100 >=
+			if (!SlabKickable(page) || page->inuse * 100 >=
 					s->defrag_ratio * page->objects) {
 				slab_unlock(page);
-				/* Slab contains enough objects */
+				/*
+				 * Slab contains enough objects
+				 * or we alrady tried reclaim before and
+				 * it failed. Skip this one.
+				 */
 				continue;
 			}
 

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 07/21] slub: Extend slabinfo to support -D and -F options
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-05-10  3:08 ` [patch 06/21] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 08/21] slub: add defrag statistics Christoph Lameter
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0007-SLUB-Extend-slabinfo-to-support-D-and-F-options.patch --]
[-- Type: text/plain, Size: 6085 bytes --]

-F lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 Documentation/vm/slabinfo.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

Index: linux-2.6/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.orig/Documentation/vm/slabinfo.c	2008-04-28 21:22:12.919899273 -0700
+++ linux-2.6/Documentation/vm/slabinfo.c	2008-04-28 21:24:22.889899117 -0700
@@ -31,6 +31,8 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -64,6 +66,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -100,13 +104,15 @@ void fatal(const char *x, ...)
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
 		"-A|--activity          Most active slabs first\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
 		"-D|--display-active    Switch line format to activity\n"
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
+		"-F|--defrag            Show defragmentable caches\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -296,7 +302,7 @@ void first_line(void)
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
 	else
 		printf("Name                   Objects Objsize    Space "
-			"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -345,7 +351,7 @@ void slab_numa(struct slabinfo *s, int m
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -354,6 +360,7 @@ void slab_numa(struct slabinfo *s, int m
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -492,6 +499,8 @@ void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -539,6 +548,12 @@ void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
 						s->partial, s->cpu_slabs);
@@ -550,6 +565,10 @@ void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'F';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -584,7 +603,8 @@ void slabcache(struct slabinfo *s)
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
 			s->objs_per_slab, s->order,
-			s->slabs ? (s->partial * 100) / s->slabs : 100,
+			s->slabs ? (s->partial * 100) /
+					(s->slabs * s->objs_per_slab) : 100,
 			s->slabs ? (s->objects * s->object_size * 100) /
 				(s->slabs * (page_size << s->order)) : 100,
 			flags);
@@ -1190,7 +1210,17 @@ void read_slab_dir(void)
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
+
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1241,10 +1271,12 @@ void output_slabs(void)
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
 	{ "activity", 0, NULL, 'A' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
+	{ "defrag", 0, NULL, 'F' },
 	{ "help", 0, NULL, 'h' },
 	{ "inverted", 0, NULL, 'i'},
 	{ "numa", 0, NULL, 'n' },
@@ -1267,7 +1299,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1323,6 +1355,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'F':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 08/21] slub: add defrag statistics
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-05-10  3:08 ` [patch 07/21] slub: Extend slabinfo to support -D and -F options Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 09/21] slub: Trigger defragmentation from memory reclaim Christoph Lameter
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0008-slub-add-defrag-statistics.patch --]
[-- Type: text/plain, Size: 9685 bytes --]

Add statistics counters for slab defragmentation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 Documentation/vm/slabinfo.c |   45 ++++++++++++++++++++++++++++++++++++--------
 include/linux/slub_def.h    |    6 +++++
 mm/slub.c                   |   29 ++++++++++++++++++++++++++--
 3 files changed, 70 insertions(+), 10 deletions(-)

Index: linux-2.6/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.orig/Documentation/vm/slabinfo.c	2008-05-07 21:23:04.902660104 -0700
+++ linux-2.6/Documentation/vm/slabinfo.c	2008-05-07 21:23:05.432658246 -0700
@@ -41,6 +41,9 @@ struct slabinfo {
 	unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
 	unsigned long deactivate_to_head, deactivate_to_tail;
 	unsigned long deactivate_remote_frees, order_fallback;
+	unsigned long shrink_calls, shrink_attempt_defrag, shrink_empty_slab;
+	unsigned long shrink_slab_skipped, shrink_slab_reclaimed;
+	unsigned long shrink_object_reclaim_failed;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -79,6 +82,7 @@ int sort_active = 0;
 int set_debug = 0;
 int show_ops = 0;
 int show_activity = 0;
+int show_defragcount = 0;
 
 /* Debug options */
 int sanity = 0;
@@ -113,6 +117,7 @@ void usage(void)
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-F|--defrag            Show defragmentable caches\n"
+		"-G:--display-defrag    Display defrag counters\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -300,6 +305,8 @@ void first_line(void)
 {
 	if (show_activity)
 		printf("Name                   Objects      Alloc       Free   %%Fast Fallb O\n");
+	else if (show_defragcount)
+		printf("Name                   Objects DefragRQ  Slabs Success   Empty Skipped  Failed\n");
 	else
 		printf("Name                   Objects Objsize    Space "
 			"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
@@ -466,22 +473,28 @@ void slab_stats(struct slabinfo *s)
 
 	printf("Total                %8lu %8lu\n\n", total_alloc, total_free);
 
-	if (s->cpuslab_flush)
-		printf("Flushes %8lu\n", s->cpuslab_flush);
-
-	if (s->alloc_refill)
-		printf("Refill %8lu\n", s->alloc_refill);
+	if (s->cpuslab_flush || s->alloc_refill)
+		printf("CPU Slab  : Flushes=%lu Refills=%lu\n",
+			s->cpuslab_flush, s->alloc_refill);
 
 	total = s->deactivate_full + s->deactivate_empty +
 			s->deactivate_to_head + s->deactivate_to_tail;
 
 	if (total)
-		printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
+		printf("Deactivate: Full=%lu(%lu%%) Empty=%lu(%lu%%) "
 			"ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
 			s->deactivate_full, (s->deactivate_full * 100) / total,
 			s->deactivate_empty, (s->deactivate_empty * 100) / total,
 			s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
 			s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+
+	if (s->shrink_calls)
+		printf("Shrink    : Calls=%lu Attempts=%lu Empty=%lu Successful=%lu\n",
+			s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_empty_slab, s->shrink_slab_reclaimed);
+	if (s->shrink_slab_skipped || s->shrink_object_reclaim_failed)
+		printf("Defrag    : Slabs skipped=%lu Object reclaim failed=%lu\n",
+		s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 }
 
 void report(struct slabinfo *s)
@@ -598,7 +611,12 @@ void slabcache(struct slabinfo *s)
 			total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
 			total_free ? (s->free_fastpath * 100 / total_free) : 0,
 			s->order_fallback, s->order);
-	}
+	} else
+	if (show_defragcount)
+		printf("%-21s %8ld %7d %7d %7d %7d %7d %7d\n",
+			s->name, s->objects, s->shrink_calls, s->shrink_attempt_defrag,
+			s->shrink_slab_reclaimed, s->shrink_empty_slab,
+			s->shrink_slab_skipped, s->shrink_object_reclaim_failed);
 	else
 		printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
@@ -1210,6 +1228,13 @@ void read_slab_dir(void)
 			slab->deactivate_to_tail = get_obj("deactivate_to_tail");
 			slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
 			slab->order_fallback = get_obj("order_fallback");
+			slab->shrink_calls = get_obj("shrink_calls");
+			slab->shrink_attempt_defrag = get_obj("shrink_attempt_defrag");
+			slab->shrink_empty_slab = get_obj("shrink_empty_slab");
+			slab->shrink_slab_skipped = get_obj("shrink_slab_skipped");
+			slab->shrink_slab_reclaimed = get_obj("shrink_slab_reclaimed");
+			slab->shrink_object_reclaim_failed =
+					get_obj("shrink_object_reclaim_failed");
 			slab->defrag_ratio = get_obj("defrag_ratio");
 			slab->remote_node_defrag_ratio =
 					get_obj("remote_node_defrag_ratio");
@@ -1274,6 +1299,7 @@ struct option opts[] = {
 	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
 	{ "display-activity", 0, NULL, 'D' },
+	{ "display-defrag", 0, NULL, 'G' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "defrag", 0, NULL, 'F' },
@@ -1299,7 +1325,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "aACd::DefFGhil1noprstvzTS",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1325,6 +1351,9 @@ int main(int argc, char *argv[])
 		case 'f':
 			show_first_alias = 1;
 			break;
+		case 'G':
+			show_defragcount = 1;
+			break;
 		case 'h':
 			usage();
 			return 0;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-05-07 21:22:55.965159854 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-05-07 21:23:05.432658246 -0700
@@ -30,6 +30,12 @@ enum stat_item {
 	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
+	SHRINK_CALLS,		/* Number of invocations of kmem_cache_shrink */
+	SHRINK_ATTEMPT_DEFRAG,	/* Slabs that were attempted to be reclaimed */
+	SHRINK_EMPTY_SLAB,	/* Shrink encountered and freed empty slab */
+	SHRINK_SLAB_SKIPPED,	/* Slab reclaim skipped an slab (busy etc) */
+	SHRINK_SLAB_RECLAIMED,	/* Successfully reclaimed slabs */
+	SHRINK_OBJECT_RECLAIM_FAILED, /* Callbacks signaled busy objects */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-07 21:23:04.371410484 -0700
+++ linux-2.6/mm/slub.c	2008-05-07 21:23:05.462668424 -0700
@@ -2848,6 +2848,7 @@ static int kmem_cache_vacate(struct page
 	void *private;
 	unsigned long flags;
 	unsigned long objects;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
 	slab_lock(page);
@@ -2858,8 +2859,11 @@ static int kmem_cache_vacate(struct page
 	s = page->slab;
 	objects = page->objects;
 	map = scratch + objects * sizeof(void **);
-	if (!page->inuse || !s->kick || !SlabKickable(page))
+	if (!page->inuse || !s->kick || !SlabKickable(page)) {
+		c = get_cpu_slab(s, smp_processor_id());
+		stat(c, SHRINK_SLAB_SKIPPED);
 		goto out;
+	}
 
 	/* Determine used objects */
 	bitmap_fill(map, objects);
@@ -2896,9 +2900,13 @@ out:
 	 * Check the result and unfreeze the slab
 	 */
 	leftover = page->inuse;
-	if (leftover)
+	c = get_cpu_slab(s, smp_processor_id());
+	if (leftover) {
 		/* Unsuccessful reclaim. Avoid future reclaim attempts. */
+		stat(c, SHRINK_OBJECT_RECLAIM_FAILED);
 		ClearSlabKickable(page);
+	} else
+		stat(c, SHRINK_SLAB_RECLAIMED);
 	unfreeze_slab(s, page, leftover > 0);
 	local_irq_restore(flags);
 	return leftover;
@@ -2949,11 +2957,14 @@ static unsigned long __kmem_cache_shrink
 	LIST_HEAD(zaplist);
 	int freed = 0;
 	struct kmem_cache_node *n = get_node(s, node);
+	struct kmem_cache_cpu *c;
 
 	if (n->nr_partial <= limit)
 		return 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
+	c = get_cpu_slab(s, smp_processor_id());
+	stat(c, SHRINK_CALLS);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
 		if (!slab_trylock(page))
 			/* Busy slab. Get out of the way */
@@ -2973,12 +2984,14 @@ static unsigned long __kmem_cache_shrink
 
 			list_move(&page->lru, &zaplist);
 			if (s->kick) {
+				stat(c, SHRINK_ATTEMPT_DEFRAG);
 				n->nr_partial--;
 				SetSlabFrozen(page);
 			}
 			slab_unlock(page);
 		} else {
 			/* Empty slab page */
+			stat(c, SHRINK_EMPTY_SLAB);
 			list_del(&page->lru);
 			n->nr_partial--;
 			slab_unlock(page);
@@ -4400,6 +4413,12 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate
 STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(SHRINK_CALLS, shrink_calls);
+STAT_ATTR(SHRINK_ATTEMPT_DEFRAG, shrink_attempt_defrag);
+STAT_ATTR(SHRINK_EMPTY_SLAB, shrink_empty_slab);
+STAT_ATTR(SHRINK_SLAB_SKIPPED, shrink_slab_skipped);
+STAT_ATTR(SHRINK_SLAB_RECLAIMED, shrink_slab_reclaimed);
+STAT_ATTR(SHRINK_OBJECT_RECLAIM_FAILED, shrink_object_reclaim_failed);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4454,6 +4473,12 @@ static struct attribute *slab_attrs[] = 
 	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&order_fallback_attr.attr,
+	&shrink_calls_attr.attr,
+	&shrink_attempt_defrag_attr.attr,
+	&shrink_empty_slab_attr.attr,
+	&shrink_slab_skipped_attr.attr,
+	&shrink_slab_reclaimed_attr.attr,
+	&shrink_object_reclaim_failed_attr.attr,
 #endif
 	NULL
 };

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 09/21] slub: Trigger defragmentation from memory reclaim
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (7 preceding siblings ...)
  2008-05-10  3:08 ` [patch 08/21] slub: add defrag statistics Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 10/21] buffer heads: Support slab defrag Christoph Lameter
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Pekka Enberg, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, mpm

[-- Attachment #1: 0009-SLUB-Trigger-defragmentation-from-memory-reclaim.patch --]
[-- Type: text/plain, Size: 10980 bytes --]

This patch triggers slab defragmentation from memory reclaim. The logical
point for this is after slab shrinking was performed in vmscan.c. At that point
the fragmentation ratio of a slab was increased because objects were freed via
the LRU lists maitained for various slab caches.
So we call kmem_cache_defrag() from there.

shrink_slab() is called in some contexts to do global shrinking
of slabs and in others to do shrinking for a particular zone. Pass the zone to
shrink_slab(), so that slab_shrink() can call kmem_cache_defrag() and restrict
the defragmentation to the node that is under memory pressure.

The callback frequency into slab reclaim can be controlled by a new field
/proc/sys/vm/slab_defrag_limit.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 Documentation/sysctl/vm.txt |   12 ++++++++
 fs/drop_caches.c            |    2 -
 include/linux/mm.h          |    3 --
 include/linux/mmzone.h      |    1 
 include/linux/swap.h        |    3 ++
 kernel/sysctl.c             |   20 +++++++++++++
 mm/vmscan.c                 |   65 +++++++++++++++++++++++++++++++++++++++-----
 mm/vmstat.c                 |    2 +
 8 files changed, 98 insertions(+), 10 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2008-05-08 17:24:44.000000000 -0700
+++ linux-2.6/fs/drop_caches.c	2008-05-09 15:13:31.000000000 -0700
@@ -58,7 +58,7 @@ static void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-05-08 17:27:30.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2008-05-09 15:13:31.000000000 -0700
@@ -1242,8 +1242,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
-
+				unsigned long lru_pages, struct zone *z);
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2008-05-08 17:24:45.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2008-05-09 15:13:32.000000000 -0700
@@ -149,6 +149,14 @@ void unregister_shrinker(struct shrinker
 EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
+
+/*
+ * Trigger a call into slab defrag if the sum of the returns from
+ * shrinkers cross this value.
+ */
+int slab_defrag_limit = 1000;
+int slab_defrag_counter;
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -166,10 +174,18 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -226,6 +242,39 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+
+
+	/* Avoid dirtying cachelines */
+	if (!ret)
+		return 0;
+
+	/*
+	 * "ret" doesnt really contain the freed object count. The shrinkers
+	 * fake it. Gotta go with what we are getting though.
+	 *
+	 * Handling of the defrag_counter is also racy. If we get the
+	 * wrong counts then we may unnecessarily do a defrag pass or defer
+	 * one. "ret" is already faked. So this is just increasing
+	 * the already existing fuzziness to get some notion as to when
+	 * to initiate slab defrag which will hopefully be okay.
+	 */
+	if (zone) {
+		/* balance_pgdat running on a zone so we only scan one node */
+		zone->slab_defrag_counter += ret;
+		if (zone->slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			zone->slab_defrag_counter = 0;
+			kmem_cache_defrag(zone_to_nid(zone));
+		}
+	} else {
+		/* Direct (and thus global) reclaim. Scan all nodes */
+		slab_defrag_counter += ret;
+		if (slab_defrag_counter > slab_defrag_limit &&
+						(gfp_mask & __GFP_FS)) {
+			slab_defrag_counter = 0;
+			kmem_cache_defrag(-1);
+		}
+	}
 	return ret;
 }
 
@@ -1342,7 +1391,7 @@ static unsigned long do_try_to_free_page
 		 * over limit cgroups
 		 */
 		if (scan_global_lru(sc)) {
-			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
+			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages, NULL);
 			if (reclaim_state) {
 				nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
@@ -1567,7 +1616,7 @@ loop_again:
 				nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone_is_all_unreclaimable(zone))
@@ -1806,7 +1855,7 @@ unsigned long shrink_all_memory(unsigned
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1844,7 +1893,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1861,7 +1910,8 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask,
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -2023,7 +2073,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2008-05-08 17:24:44.000000000 -0700
+++ linux-2.6/include/linux/mmzone.h	2008-05-09 15:13:32.000000000 -0700
@@ -256,6 +256,7 @@ struct zone {
 	unsigned long		nr_scan_active;
 	unsigned long		nr_scan_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
+	unsigned long		slab_defrag_counter; /* since last defrag */
 	unsigned long		flags;		   /* zone flags, see below */
 
 	/* Zone statistics */
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2008-05-08 17:24:44.000000000 -0700
+++ linux-2.6/include/linux/swap.h	2008-05-09 15:13:32.000000000 -0700
@@ -188,6 +188,9 @@ extern unsigned long try_to_free_mem_cgr
 extern int __isolate_lru_page(struct page *page, int mode);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
+extern int slab_defrag_limit;
+extern int slab_defrag_counter;
+
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2008-05-08 17:24:45.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2008-05-09 15:13:32.000000000 -0700
@@ -1035,6 +1035,26 @@ static struct ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_limit",
+		.data		= &slab_defrag_limit,
+		.maxlen		= sizeof(slab_defrag_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one_hundred,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_count",
+		.data		= &slab_defrag_counter,
+		.maxlen		= sizeof(slab_defrag_counter),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.ctl_name	= VM_LEGACY_VA_LAYOUT,
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2008-05-08 17:24:44.000000000 -0700
+++ linux-2.6/Documentation/sysctl/vm.txt	2008-05-09 15:13:32.000000000 -0700
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/
 - numa_zonelist_order
 - nr_hugepages
 - nr_overcommit_hugepages
+- slab_defrag_limit
 
 ==============================================================
 
@@ -347,3 +348,14 @@ Change the maximum size of the hugepage 
 nr_hugepages + nr_overcommit_hugepages.
 
 See Documentation/vm/hugetlbpage.txt
+
+==============================================================
+
+slab_defrag_limit
+
+Determines the frequency of calls from reclaim into slab defragmentation.
+Slab defrag reclaims objects from sparsely populates slab pages.
+The default is 1000. Increase if slab defragmentation occurs
+too frequently. Decrease if more slab defragmentation passes
+are needed. The slabinfo tool can report on the frequency of the callbacks.
+
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2008-05-08 17:24:45.000000000 -0700
+++ linux-2.6/mm/vmstat.c	2008-05-09 15:15:10.000000000 -0700
@@ -711,9 +711,11 @@ static void zoneinfo_show_print(struct s
 #endif
 	}
 	seq_printf(m,
+		   "\n  slab_defrag_count: %lu"
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu",
+		   	   zone->slab_defrag_counter,
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
 		   zone->zone_start_pfn);

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 10/21] buffer heads: Support slab defrag
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (8 preceding siblings ...)
  2008-05-10  3:08 ` [patch 09/21] slub: Trigger defragmentation from memory reclaim Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-12  0:24   ` David Chinner
  2008-05-10  3:08 ` [patch 11/21] inodes: Support generic defragmentation Christoph Lameter
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: 0024-Buffer-heads-Support-slab-defrag.patch --]
[-- Type: text/plain, Size: 3257 bytes --]

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/buffer.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2008-05-07 20:27:15.182659486 -0700
+++ linux-2.6/fs/buffer.c	2008-05-07 20:29:13.052102980 -0700
@@ -3255,6 +3255,104 @@ int bh_submit_read(struct buffer_head *b
 }
 EXPORT_SYMBOL(bh_submit_read);
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int rc;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 0
+	};
+
+	if (!mapping->a_ops->writepage)
+		/* No write method for the address space */
+		return;
+
+	if (!clear_page_dirty_for_io(page))
+		/* Someone else already triggered a write */
+		return;
+
+	rc = mapping->a_ops->writepage(page, &wbc);
+	if (rc < 0)
+		/* I/O Error writing */
+		return;
+
+	if (rc == AOP_WRITEPAGE_ACTIVATE)
+		unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+	struct page *page;
+	struct buffer_head *bh;
+	int i, j;
+	int n = 0;
+
+	for (i = 0; i < nr; i++) {
+		bh = v[i];
+		v[i] = NULL;
+
+		page = bh->b_page;
+
+		if (page && PagePrivate(page)) {
+			for (j = 0; j < n; j++)
+				if (page == v[j])
+					continue;
+		}
+
+		if (get_page_unless_zero(page))
+			v[n++] = page;
+	}
+	return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was set up by get_buffer().
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+							void *private)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		page = v[i];
+
+		if (!page || PageWriteback(page))
+			continue;
+
+		if (!TestSetPageLocked(page)) {
+			if (PageDirty(page))
+				trigger_write(page);
+			else {
+				if (PagePrivate(page))
+					try_to_free_buffers(page);
+				unlock_page(page);
+			}
+		}
+		put_page(page);
+	}
+}
+
 static void
 init_buffer_head(struct kmem_cache *cachep, void *data)
 {
@@ -3273,6 +3371,7 @@ void __init buffer_init(void)
 				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 				SLAB_MEM_SPREAD),
 				init_buffer_head);
+	kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 11/21] inodes: Support generic defragmentation
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (9 preceding siblings ...)
  2008-05-10  3:08 ` [patch 10/21] buffer heads: Support slab defrag Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 12/21] Filesystem: Ext2 filesystem defrag Christoph Lameter
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Alexander Viro, Christoph Hellwig, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

[-- Attachment #1: 0025-inodes-Support-generic-defragmentation.patch --]
[-- Type: text/plain, Size: 5171 bytes --]

This implements the ability to remove inodes in a particular slab
from inode caches. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/inode.c         |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    6 ++
 2 files changed, 129 insertions(+)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2008-05-07 20:27:10.563908386 -0700
+++ linux-2.6/fs/inode.c	2008-05-07 20:48:14.473081107 -0700
@@ -1363,6 +1363,128 @@ static int __init set_ihash_entries(char
 __setup("ihash_entries=", set_ihash_entries);
 
 /*
+ * Obtain a refcount on a list of struct inodes pointed to by v. If the
+ * inode is in the process of being freed then zap the v[] entry so that
+ * we skip the freeing attempts later.
+ *
+ * This is a generic function for the ->get slab defrag callback.
+ */
+void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(get_inodes);
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * fs inode. The offset is the offset of the struct inode in the fs inode.
+ *
+ * The function adds to the pointers in v[] in order to make them point to
+ * struct inode. Then get_inodes() is used to get the refcount.
+ * The converted v[] pointers can then also be passed to the kick() callback
+ * without further processing.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+/*
+ * Generic callback function slab defrag ->kick methods. Takes the
+ * array with inodes where we obtained refcounts using fs_get_inodes()
+ * or get_inodes() and tries to free them.
+ */
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	int active;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				/*
+				 * Should be really be doing this? Or
+				 * limit this if there are only a few pages?
+				 *
+				 * Possibly an expensive operation but we
+				 * cannot reclaim the inode if the pages
+				 * are still present.
+				 */
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+
+		if (!inode)
+			/* inode is alrady being freed */
+			continue;
+
+		active = inode->i_sb->s_flags & MS_ACTIVE;
+		iput(inode);
+		if (abort || !active)
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+/*
  * Initialize the waitqueues and inode hash table.
  */
 void __init inode_init_early(void)
@@ -1401,6 +1523,7 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-05-07 20:27:10.573910205 -0700
+++ linux-2.6/include/linux/fs.h	2008-05-07 20:30:15.153909886 -0700
@@ -1826,6 +1826,12 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *get_inodes(struct kmem_cache *, int nr, void **);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 12/21] Filesystem: Ext2 filesystem defrag
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (10 preceding siblings ...)
  2008-05-10  3:08 ` [patch 11/21] inodes: Support generic defragmentation Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 13/21] Filesystem: Ext3 " Christoph Lameter
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: ext2-defrag --]
[-- Type: text/plain, Size: 1049 bytes --]

Support defragmentation for ext2 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext2/super.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c	2008-05-07 21:23:41.931409650 -0700
+++ linux-2.6/fs/ext2/super.c	2008-05-07 21:24:42.951410526 -0700
@@ -170,6 +170,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -179,6 +185,9 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
+
+	kmem_cache_setup_defrag(ext2_inode_cachep,
+			ext2_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 13/21] Filesystem: Ext3 filesystem defrag
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (11 preceding siblings ...)
  2008-05-10  3:08 ` [patch 12/21] Filesystem: Ext2 filesystem defrag Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 14/21] Filesystem: Ext4 " Christoph Lameter
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: ext3-defrag --]
[-- Type: text/plain, Size: 1046 bytes --]

Support defragmentation for ext3 filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext3/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c	2008-05-07 21:23:41.941410081 -0700
+++ linux-2.6/fs/ext3/super.c	2008-05-07 21:25:55.453910361 -0700
@@ -484,6 +484,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext3_inode_cachep,
+			ext3_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 14/21] Filesystem: Ext4 filesystem defrag
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (12 preceding siblings ...)
  2008-05-10  3:08 ` [patch 13/21] Filesystem: Ext3 " Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 15/21] Filesystem: XFS slab defragmentation Christoph Lameter
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: ext4-defrag --]
[-- Type: text/plain, Size: 1046 bytes --]

Support defragmentation for extX filesystem inodes

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext4/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2008-05-07 21:23:41.961409593 -0700
+++ linux-2.6/fs/ext4/super.c	2008-05-07 21:27:18.215159859 -0700
@@ -599,6 +599,12 @@ static void init_once(struct kmem_cache 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -608,6 +614,8 @@ static int init_inodecache(void)
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(ext4_inode_cachep,
+			ext4_get_inodes, kick_inodes);
 	return 0;
 }
 

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 15/21] Filesystem: XFS slab defragmentation
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (13 preceding siblings ...)
  2008-05-10  3:08 ` [patch 14/21] Filesystem: Ext4 " Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  6:55   ` Christoph Hellwig
  2008-05-10  3:08 ` [patch 16/21] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: 0027-FS-XFS-slab-defragmentation.patch --]
[-- Type: text/plain, Size: 746 bytes --]

Support inode defragmentation for xfs

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 8831d95..555d84c 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -862,6 +862,7 @@ xfs_init_zones(void)
 	xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
 	if (!xfs_ioend_zone)
 		goto out_destroy_vnode_zone;
+	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);
 
 	xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
 						  xfs_ioend_zone);
-- 
1.5.4.4

-- 

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [patch 16/21] Filesystem: /proc filesystem support for slab defrag
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (14 preceding siblings ...)
  2008-05-10  3:08 ` [patch 15/21] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 17/21] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Alexey Dobriyan, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

[-- Attachment #1: 0028-FS-Proc-filesystem-support-for-slab-defrag.patch --]
[-- Type: text/plain, Size: 1030 bytes --]

Support procfs inode defragmentation

Cc: Alexey Dobriyan <adobriyan@sw.ru>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/proc/inode.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 82b3a1b..5bc8d23 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -104,6 +104,12 @@ static void init_once(struct kmem_cache * cachep, void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct proc_inode, vfs_inode));
+};
+
 int __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -111,6 +117,8 @@ int __init proc_init_inodecache(void)
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD|SLAB_PANIC),
 					     init_once);
+	kmem_cache_setup_defrag(proc_inode_cachep,
+				proc_get_inodes, kick_inodes);
 	return 0;
 }
 
-- 
1.5.4.4

-- 

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [patch 17/21] Filesystem: Slab defrag: Reiserfs support
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (15 preceding siblings ...)
  2008-05-10  3:08 ` [patch 16/21] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 18/21] Filesystem: Socket inode defragmentation Christoph Lameter
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: 0029-FS-Slab-defrag-Reiserfs-support.patch --]
[-- Type: text/plain, Size: 1006 bytes --]

Slab defragmentation: Support reiserfs inode defragmentation.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/reiserfs/super.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 393cc22..69b4a86 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -532,6 +532,12 @@ static void init_once(struct kmem_cache * cachep, void *foo)
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -542,6 +548,8 @@ static int init_inodecache(void)
 						  init_once);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(reiserfs_inode_cachep,
+			reiserfs_get_inodes, kick_inodes);
 	return 0;
 }
 
-- 
1.5.4.4

-- 

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [patch 18/21] Filesystem: Socket inode defragmentation
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (16 preceding siblings ...)
  2008-05-10  3:08 ` [patch 17/21] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-13 13:28   ` Evgeniy Polyakov
  2008-05-10  3:08 ` [patch 19/21] dentries: Add constructor Christoph Lameter
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, netdev, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

[-- Attachment #1: 0030-FS-Socket-inode-defragmentation.patch --]
[-- Type: text/plain, Size: 978 bytes --]

Support inode defragmentation for sockets

Cc: netdev@vger.kernel.org
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 net/socket.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 9d3fbfb..205f450 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -269,6 +269,12 @@ static void init_once(struct kmem_cache *cachep, void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *sock_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct socket_alloc, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
@@ -280,6 +286,8 @@ static int init_inodecache(void)
 					      init_once);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
+	kmem_cache_setup_defrag(sock_inode_cachep,
+			sock_get_inodes, kick_inodes);
 	return 0;
 }
 
-- 
1.5.4.4

-- 

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [patch 19/21] dentries: Add constructor
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (17 preceding siblings ...)
  2008-05-10  3:08 ` [patch 18/21] Filesystem: Socket inode defragmentation Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 20/21] dentries: dentry defragmentation Christoph Lameter
  2008-05-10  3:08 ` [patch 21/21] slab defrag: Obsolete SLAB Christoph Lameter
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Alexander Viro, Christoph Hellwig, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

[-- Attachment #1: 0031-dentries-Add-constructor.patch --]
[-- Type: text/plain, Size: 2278 bytes --]

In order to support defragmentation on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

So provide a constructor.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/dcache.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2008-05-07 19:59:03.441408469 -0700
+++ linux-2.6/fs/dcache.c	2008-05-07 21:22:07.891408627 -0700
@@ -870,6 +870,16 @@ static struct shrinker dcache_shrinker =
 	.seeks = DEFAULT_SEEKS,
 };
 
+static void dcache_ctor(struct kmem_cache *s, void *p)
+{
+	struct dentry *dentry = p;
+
+	spin_lock_init(&dentry->d_lock);
+	dentry->d_inode = NULL;
+	INIT_LIST_HEAD(&dentry->d_lru);
+	INIT_LIST_HEAD(&dentry->d_alias);
+}
+
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -907,8 +917,6 @@ struct dentry *d_alloc(struct dentry * p
 
 	atomic_set(&dentry->d_count, 1);
 	dentry->d_flags = DCACHE_UNHASHED;
-	spin_lock_init(&dentry->d_lock);
-	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
@@ -918,9 +926,7 @@ struct dentry *d_alloc(struct dentry * p
 	dentry->d_cookie = NULL;
 #endif
 	INIT_HLIST_NODE(&dentry->d_hash);
-	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
-	INIT_LIST_HEAD(&dentry->d_alias);
 
 	if (parent) {
 		dentry->d_parent = dget(parent);
@@ -2148,14 +2154,10 @@ static void __init dcache_init(void)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+		0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		dcache_ctor);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 20/21] dentries: dentry defragmentation
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (18 preceding siblings ...)
  2008-05-10  3:08 ` [patch 19/21] dentries: Add constructor Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  3:08 ` [patch 21/21] slab defrag: Obsolete SLAB Christoph Lameter
  20 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, Alexander Viro, Christoph Hellwig, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

[-- Attachment #1: 0032-dentries-dentry-defragmentation.patch --]
[-- Type: text/plain, Size: 4151 bytes --]

The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.

Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/dcache.c |  101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2008-04-29 14:11:14.601208368 -0700
+++ linux-2.6/fs/dcache.c	2008-04-29 14:11:14.621208582 -0700
@@ -31,6 +31,7 @@
 #include <linux/seqlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 
@@ -143,7 +144,10 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
-	/*drops the locks, at that point nobody can reach this dentry */
+	/*
+	 * drops the locks, at that point nobody (aside from defrag)
+	 * can reach this dentry
+	 */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
@@ -2150,6 +2154,100 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		/*
+		 * Three sorts of dentries cannot be reclaimed:
+		 *
+		 * 1. dentries that are in the process of being allocated
+		 *    or being freed. In that case the dentry is neither
+		 *    on the LRU nor hashed.
+		 *
+		 * 2. Fake hashed entries as used for anonymous dentries
+		 *    and pipe I/O. The fake hashed entries have d_flags
+		 *    set to indicate a hashed entry. However, the
+		 *    d_hash field indicates that the entry is not hashed.
+		 *
+		 * 3. dentries that have a backing store that is not
+		 *    writable. This is true for tmpsfs and other in
+		 *    memory filesystems. Removing dentries from them
+		 *    would loose dentries for good.
+		 */
+		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+		   (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+		   (dentry->d_inode &&
+		   !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+			/* Ignore this dentry */
+			v[i] = NULL;
+		else
+			/* dget_locked will remove the dentry from the LRU */
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+				int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		prune_one_dentry(dentry);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations are complete.
+	 */
+	synchronize_rcu();
+}
+
 static void __init dcache_init(void)
 {
 	int loop;
@@ -2159,6 +2257,7 @@ static void __init dcache_init(void)
 		dcache_ctor);
 
 	register_shrinker(&dcache_shrinker);
+	kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
                   ` (19 preceding siblings ...)
  2008-05-10  3:08 ` [patch 20/21] dentries: dentry defragmentation Christoph Lameter
@ 2008-05-10  3:08 ` Christoph Lameter
  2008-05-10  9:53   ` Andi Kleen
  20 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-10  3:08 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-fsdevel, Mel Gorman, andi, Rik van Riel,
	Pekka Enberg, mpm

[-- Attachment #1: slab_experimental --]
[-- Type: text/plain, Size: 1954 bytes --]

Slab defragmentation introduces new functionality not supported by SLAB and
SLOB.

Make slab depend on EXPERIMENTAL and note its obsoleteness and that
various functionality is not supported by SLAB.

Also update SLOB's description a bit to indicate that certain OS
support is limited by design.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 init/Kconfig |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig	2008-05-09 18:41:41.000000000 -0700
+++ linux-2.6/init/Kconfig	2008-05-09 18:46:13.000000000 -0700
@@ -749,12 +749,16 @@ choice
 	   This option allows to select a slab allocator.
 
 config SLAB
-	bool "SLAB"
+	bool "SLAB (Obsolete)"
+	depends on EXPERIMENTAL
 	help
-	  The regular slab allocator that is established and known to work
-	  well in all environments. It organizes cache hot objects in
-	  per cpu and per node queues. SLAB is the default choice for
-	  a slab allocator.
+	  The old slab allocator that is being replaced by SLUB.
+	  SLAB does not support slab defragmentation and has limited
+	  debugging support. There is no sysfs support for /sys/kernel/slab.
+	  SLAB requires order 1 allocations for some caches which may under
+	  extreme circumstances fail. New general object debugging methods
+	  (such as kmemcheck) do not support SLAB. The code is complex,
+	  difficult to comprehend and has a history of subtle bugs.
 
 config SLUB
 	bool "SLUB (Unqueued Allocator)"
@@ -771,7 +775,10 @@ config SLOB
 	help
 	   SLOB replaces the stock allocator with a drastically simpler
 	   allocator. SLOB is generally more space efficient but
-	   does not perform as well on large systems.
+	   does not perform as well on large systems. SLOBs functionality
+	   is limited by design (no sysfs support, no defrag, no debugging
+	   etc).
+
 
 endchoice
 

-- 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 15/21] Filesystem: XFS slab defragmentation
  2008-05-10  3:08 ` [patch 15/21] Filesystem: XFS slab defragmentation Christoph Lameter
@ 2008-05-10  6:55   ` Christoph Hellwig
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2008-05-10  6:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Fri, May 09, 2008 at 08:08:46PM -0700, Christoph Lameter wrote:
> +	kmem_cache_setup_defrag(xfs_vnode_zone, get_inodes, kick_inodes);

So you're exporting get_inodes and kick_inodes just to use it in always
the same way.  Much better to have a kmem_cache_set_inode_defrag helper
and keep them static in inode.c


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-10  3:08 ` [patch 21/21] slab defrag: Obsolete SLAB Christoph Lameter
@ 2008-05-10  9:53   ` Andi Kleen
  2008-05-11  2:15     ` Rik van Riel
  0 siblings, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2008-05-10  9:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, Rik van Riel,
	Pekka Enberg, mpm

Christoph Lameter wrote:
> Slab defragmentation introduces new functionality not supported by SLAB and
> SLOB.
> 
> Make slab depend on EXPERIMENTAL and note its obsoleteness and that
> various functionality is not supported by SLAB.
> 
> Also update SLOB's description a bit to indicate that certain OS
> support is limited by design.

What about the TPC performance regressions? My understanding was that
slub still performed worse in the "object allocated on one CPU, freed on
the other CPU" type workloads due to less batching.

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-10  9:53   ` Andi Kleen
@ 2008-05-11  2:15     ` Rik van Riel
  2008-05-12  7:38       ` KOSAKI Motohiro
  0 siblings, 1 reply; 93+ messages in thread
From: Rik van Riel @ 2008-05-11  2:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	Pekka Enberg, mpm

On Sat, 10 May 2008 11:53:30 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> Christoph Lameter wrote:
> > Slab defragmentation introduces new functionality not supported by SLAB and
> > SLOB.
> > 
> > Make slab depend on EXPERIMENTAL and note its obsoleteness and that
> > various functionality is not supported by SLAB.
> > 
> > Also update SLOB's description a bit to indicate that certain OS
> > support is limited by design.
> 
> What about the TPC performance regressions? My understanding was that
> slub still performed worse in the "object allocated on one CPU, freed on
> the other CPU" type workloads due to less batching.

Which can be the majority of object allocations and frees in some
workloads.  It definately wants fixing.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-10  3:08 ` [patch 10/21] buffer heads: Support slab defrag Christoph Lameter
@ 2008-05-12  0:24   ` David Chinner
  2008-05-15 17:42     ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-12  0:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Fri, May 09, 2008 at 08:08:41PM -0700, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.

Oh, no, please don't trigger more random single page writeback from
memory reclaim.  We shoul dbe killing the VM's use of ->writepage,
not encouraging it.

If you are going to clean bufferheads (or pages), please clean entire
mappings via ->writepages as it leads to far superior I/O patterns
and a far higher aggregate rate of page cleaning.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-11  2:15     ` Rik van Riel
@ 2008-05-12  7:38       ` KOSAKI Motohiro
  2008-05-12  7:54         ` Pekka Enberg
  0 siblings, 1 reply; 93+ messages in thread
From: KOSAKI Motohiro @ 2008-05-12  7:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Christoph Lameter, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, Pekka Enberg, mpm

>> What about the TPC performance regressions? My understanding was that
>> slub still performed worse in the "object allocated on one CPU, freed on
>> the other CPU" type workloads due to less batching.
>
> Which can be the majority of object allocations and frees in some
> workloads.  It definately wants fixing.

Agreed, that situation is very frequency happend, IMHO.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-12  7:38       ` KOSAKI Motohiro
@ 2008-05-12  7:54         ` Pekka Enberg
  2008-05-12 10:08           ` Andi Kleen
  2008-05-14 17:29           ` Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Pekka Enberg @ 2008-05-12  7:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andi Kleen, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

On Mon, May 12, 2008 at 10:38 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> > > What about the TPC performance regressions? My understanding was that
> > > slub still performed worse in the "object allocated on one CPU, freed on
> > > the other CPU" type workloads due to less batching.
> >
> > Which can be the majority of object allocations and frees in some
> > workloads.  It definately wants fixing.
>
>  Agreed, that situation is very frequency happend, IMHO.

Christoph fixed a tbench regression that was in the same ballpark as
the TPC regression reported by Matthew which is why we've asked the
Intel folks to re-test. But yeah, we're working on it.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-12  7:54         ` Pekka Enberg
@ 2008-05-12 10:08           ` Andi Kleen
  2008-05-12 10:23             ` Pekka Enberg
  2008-05-14 17:29           ` Christoph Lameter
  1 sibling, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2008-05-12 10:08 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Rik van Riel, Andi Kleen, Christoph Lameter,
	akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm,
	Matthew Wilcox, Zhang, Yanmin

> Christoph fixed a tbench regression that was in the same ballpark as
> the TPC regression reported by Matthew which is why we've asked the
> Intel folks to re-test. But yeah, we're working on it.

What are your plans to fix it?

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-12 10:08           ` Andi Kleen
@ 2008-05-12 10:23             ` Pekka Enberg
  2008-05-14 17:30               ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Pekka Enberg @ 2008-05-12 10:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox,
	Zhang, Yanmin

On Mon, May 12, 2008 at 1:08 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Christoph fixed a tbench regression that was in the same ballpark as
> > the TPC regression reported by Matthew which is why we've asked the
> > Intel folks to re-test. But yeah, we're working on it.
>
>  What are your plans to fix it?

I don't have a reproducable test and my boxes are so tiny the issues
probably won't show up anyway. So all I can do at this point is to
make sure Matthew et al can easily re-test whenever we fix some other
regression that potentially affects his workload.

I only recently started tracking this issue so I have no idea where
we're at with this. Christoph? Matthew?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 18/21] Filesystem: Socket inode defragmentation
  2008-05-10  3:08 ` [patch 18/21] Filesystem: Socket inode defragmentation Christoph Lameter
@ 2008-05-13 13:28   ` Evgeniy Polyakov
  2008-05-15 17:40     ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-13 13:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, netdev, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

Hi Christoph.

On Fri, May 09, 2008 at 08:08:49PM -0700, Christoph Lameter (clameter@sgi.com) wrote:
> Support inode defragmentation for sockets

Out of curiosity, how can you drop socket inode, since it is always
attached to socket which is removed automatically when connection is
closed. Any force of dropping socket inode can only result in connection
drop, i.e. there are no inodes, which are placed in cache and are not
yet freed, if there are no attached sockets.

So question is how does it work for sockets?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-12  7:54         ` Pekka Enberg
  2008-05-12 10:08           ` Andi Kleen
@ 2008-05-14 17:29           ` Christoph Lameter
  2008-05-14 17:49             ` Andi Kleen
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 17:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Rik van Riel, Andi Kleen, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

On Mon, 12 May 2008, Pekka Enberg wrote:

> Christoph fixed a tbench regression that was in the same ballpark as
> the TPC regression reported by Matthew which is why we've asked the
> Intel folks to re-test. But yeah, we're working on it.

I suspect that the TPC regression was due to the page allocator order 0 
inefficiencies like the tbench regression but we have no data yet to 
establish that.

Fundamentally there is no way to avoid complex queueing on free() unless 
one directly frees the object. This is serialized in SLUB by taking a page 
lock. If we can establish that the object is from the current cpu slab 
then no lock is taken because the slab is reserved for the current 
processor. So the bad case is a free of a object with a long life span or 
an object freed on a remote processor.

Howver, the "slow" case in SLUB is still much less complex 
than comparable processing in SLAB. It is quite fast.

SLAB freeing can avoid taking a lock if

1. We can establish that the object is node local (trivial if !NUMA 
otherwise we need to get the node information from the page struct and 
compare to the current node).

2. There is space in the per cpu queue

If the object is *not* node local then we have to take an alien lock for 
the remote node in order to put the object in an alien queue. That is much 
less efficient than the SLUB case. SLAB then needs to run the cache reaper 
to expire these object into the remote nodes queues (later the cache 
reaper may then actually free these objects). This management overhead 
does not exist in SLUB. The cache reaper causes processors to not be 
available for short time frames (the reaper scans through all slab 
caches!) which in turn cause regression in applications that need to 
respond in a short time frame (HPC appls, network applications that are 
timing critical).

Note that the lock granularity in SLUB is finer than the locks in SLAB. 
SLUB can concurrently free multiple objects to the same remote node etc 
etc. If the objects belong to different slabs then there is no dirtying of 
any shared cachelines.

The main issue for SLAB vs. SLUB on free is likely the !NUMA case in which 
SLAB can avoid the overhead of the node check (which does not exist in 
SLUB) and in which case we can always immediately batch the object (if 
there is space). The additional overhead in SLUB is mainly one 
atomic instruction over the SLAB fastpath.

So I think that the free need to  stay as is. The disadvantages in terms 
of the complexity of handling the objects and expiring them and the issue 
of having to take per node locks in SLAB makes it hard to justify adding a 
queue for free in SLUB. Maybe someone has an inspiration on how to do this 
effective that is better than my attempts which always ultimately ended 
implementing code that thad the same issues that we have in SLAB.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-12 10:23             ` Pekka Enberg
@ 2008-05-14 17:30               ` Christoph Lameter
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 17:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

On Mon, 12 May 2008, Pekka Enberg wrote:

> I only recently started tracking this issue so I have no idea where
> we're at with this. Christoph? Matthew?

I suspect that this is the same issue as tbench. I have explained the SLAB 
vs. SLUB free situation in another email in this thread.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 17:29           ` Christoph Lameter
@ 2008-05-14 17:49             ` Andi Kleen
  2008-05-14 18:03               ` Christoph Lameter
  2008-05-14 18:05               ` Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Andi Kleen @ 2008-05-14 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

Christoph Lameter wrote:

> Fundamentally there is no way to avoid complex queueing on free() unless 
> one directly frees the object. This is serialized in SLUB by taking a page 
> lock. 

iirc profiling analysis showed that the problem was the page lock
serialization (in particular the slab_lock() in __slab_free). That
was on 2.6.24.2

> Howver, the "slow" case in SLUB is still much less complex 
> than comparable processing in SLAB. It is quite fast.

Well in the benchmark it is slower.


> SLAB freeing can avoid taking a lock if
> 
> 1. We can establish that the object is node local (trivial if !NUMA 
> otherwise we need to get the node information from the page struct and 
> compare to the current node).

Ignoring NUMA is no option unfortunately. And with integrated memory
controller many of the remote CPU frees are off node.

> The main issue for SLAB vs. SLUB on free is likely the !NUMA case in which 
> SLAB can avoid the overhead of the node check (which does not exist in 
> SLUB) and in which case we can always immediately batch the object (if 
> there is space). The additional overhead in SLUB is mainly one 
> atomic instruction over the SLAB fastpath.

I think the problem is that this atomic operation thrashes cache lines
around. Really counting cycles on instructions is not that interesting,
but minimizing the cache thrashing is. And for that it looks like slub
is worse.

> So I think that the free need to  stay as is. The disadvantages in terms 
> of the complexity of handling the objects and expiring them and the issue 
> of having to take per node locks in SLAB makes it hard to justify adding a 
> queue for free in SLUB. Maybe someone has an inspiration on how to do this 
> effective that is better than my attempts which always ultimately ended 
> implementing code that thad the same issues that we have in SLAB.

What is the big problem of having a batched free queue? If the expiry
is done at a good bounded time (e.g. on interrupt exit or similar)
locally on the CPU it shouldn't be a big issue, should it?

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 17:49             ` Andi Kleen
@ 2008-05-14 18:03               ` Christoph Lameter
  2008-05-14 18:18                 ` Matt Mackall
  2008-05-15  3:26                 ` Zhang, Yanmin
  2008-05-14 18:05               ` Christoph Lameter
  1 sibling, 2 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 18:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

On Wed, 14 May 2008, Andi Kleen wrote:

> iirc profiling analysis showed that the problem was the page lock
> serialization (in particular the slab_lock() in __slab_free). That
> was on 2.6.24.2

Do you have an URL?

> I think the problem is that this atomic operation thrashes cache lines
> around. Really counting cycles on instructions is not that interesting,
> but minimizing the cache thrashing is. And for that it looks like slub
> is worse.

It can thrash cachelines if objects from the same slab page are freed 
simultaneously on multiple processors. That occurred in the hackbench 
regression that we addressed with the dynamic configuration of slab sizes.

However, typically long lived objects freed from multiple processors 
belong to different slab caches.

> > So I think that the free need to  stay as is. The disadvantages in terms 
> > of the complexity of handling the objects and expiring them and the issue 
> > of having to take per node locks in SLAB makes it hard to justify adding a 
> > queue for free in SLUB. Maybe someone has an inspiration on how to do this 
> > effective that is better than my attempts which always ultimately ended 
> > implementing code that thad the same issues that we have in SLAB.
> 
> What is the big problem of having a batched free queue? If the expiry
> is done at a good bounded time (e.g. on interrupt exit or similar)
> locally on the CPU it shouldn't be a big issue, should it?

Interrupt exit in general would have to inspect the per cpu structures of 
all slab caches on the system?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 17:49             ` Andi Kleen
  2008-05-14 18:03               ` Christoph Lameter
@ 2008-05-14 18:05               ` Christoph Lameter
  2008-05-14 20:46                 ` Christoph Lameter
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 18:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

On Wed, 14 May 2008, Andi Kleen wrote:

> Ignoring NUMA is no option unfortunately. And with integrated memory
> controller many of the remote CPU frees are off node.

The issue of object expiration holdoffs also affects applications 
running on pure SMP systems.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 18:03               ` Christoph Lameter
@ 2008-05-14 18:18                 ` Matt Mackall
  2008-05-14 19:21                   ` Christoph Lameter
  2008-05-15  3:26                 ` Zhang, Yanmin
  1 sibling, 1 reply; 93+ messages in thread
From: Matt Mackall @ 2008-05-14 18:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin


On Wed, 2008-05-14 at 11:03 -0700, Christoph Lameter wrote:
> > What is the big problem of having a batched free queue? If the expiry
> > is done at a good bounded time (e.g. on interrupt exit or similar)
> > locally on the CPU it shouldn't be a big issue, should it?
> 
> Interrupt exit in general would have to inspect the per cpu structures of 
> all slab caches on the system?

Why's that? When we're not under pressure (fast path), we can delay (and
batch) remote frees. When we are under pressure (slow path), we can do
everything immediately.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 18:18                 ` Matt Mackall
@ 2008-05-14 19:21                   ` Christoph Lameter
  2008-05-14 19:49                     ` Matt Mackall
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 19:21 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin

On Wed, 14 May 2008, Matt Mackall wrote:

> > Interrupt exit in general would have to inspect the per cpu structures of 
> > all slab caches on the system?
> 
> Why's that? When we're not under pressure (fast path), we can delay (and
> batch) remote frees. When we are under pressure (slow path), we can do
> everything immediately.

Fastpath is what? I guess slow path means we called into the page 
allocator?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 19:21                   ` Christoph Lameter
@ 2008-05-14 19:49                     ` Matt Mackall
  2008-05-14 20:33                       ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matt Mackall @ 2008-05-14 19:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin


On Wed, 2008-05-14 at 12:21 -0700, Christoph Lameter wrote:
> On Wed, 14 May 2008, Matt Mackall wrote:
> 
> > > Interrupt exit in general would have to inspect the per cpu structures of 
> > > all slab caches on the system?
> > 
> > Why's that? When we're not under pressure (fast path), we can delay (and
> > batch) remote frees. When we are under pressure (slow path), we can do
> > everything immediately.
> 
> Fastpath is what? I guess slow path means we called into the page 
> allocator?

No, slow path here means we're already under memory pressure, so we
don't care if something takes longer if it saves memory.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 19:49                     ` Matt Mackall
@ 2008-05-14 20:33                       ` Christoph Lameter
  2008-05-14 21:02                         ` Matt Mackall
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 20:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin

On Wed, 14 May 2008, Matt Mackall wrote:

> > allocator?
> 
> No, slow path here means we're already under memory pressure, so we
> don't care if something takes longer if it saves memory.

So expire the queues under memory pressure only? Trigger queue cleanup 
from reclaim?



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 18:05               ` Christoph Lameter
@ 2008-05-14 20:46                 ` Christoph Lameter
  2008-05-14 20:58                   ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 20:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox, Zhang, Yanmin

Some more on SMP scaling:

There is also the issue in SLAB that global locks (SMP case) need to be 
taken for a pretty long timeframe. With a sufficiently high allocation 
frequency from multiple processors you can cause lock contention on the 
list_lock that will then degrade performance.

SLUB does not take global locks for continued allocations. Global locks 
are taken for a short time frame if the partial lists need to be updated 
(which is avoided as much as possible with various measures). This can 
yield orders of magnitude higher performance

The above is possible because of locking at the page level. A queue must 
either be processor specific or global (or per node) and then would 
need locks.

Another issue is storage density. SLAB needs a metadata structure that 
either is placed in the slab page itself or in a separate slab cache. In 
some cases this is advantageous over SLUB (f.e. a series of pointers to 
objects exist in a single cacheline thus allocation of objects that are 
not immediately used could be faster) in others it is not (because it 
increases cache footprint, requires the touching of two slabcaches if the 
metadata is off slab, makes alignment of objects in the slab pages 
difficult and increases memory overhead, SLUB generally is faster if the 
object is/was immediately used since the freepointer overlays the data 
and thus the cacheline is hot both on alloc and free).

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 20:46                 ` Christoph Lameter
@ 2008-05-14 20:58                   ` Matthew Wilcox
  2008-05-14 21:00                     ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-14 20:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, May 14, 2008 at 01:46:52PM -0700, Christoph Lameter wrote:
> Some more on SMP scaling:

These are all great theories, and you mentioned that you'd fixed the
regressions with tbench, but did you fix the regression with the io-gen
program I sent you?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 20:58                   ` Matthew Wilcox
@ 2008-05-14 21:00                     ` Christoph Lameter
  2008-05-14 21:21                       ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 21:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, 14 May 2008, Matthew Wilcox wrote:

> On Wed, May 14, 2008 at 01:46:52PM -0700, Christoph Lameter wrote:
> > Some more on SMP scaling:
> 
> These are all great theories, and you mentioned that you'd fixed the
> regressions with tbench, but did you fix the regression with the io-gen
> program I sent you?

No. I thought you were satisfied with the performance increase you saw 
when pinning the process to a single processor?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 20:33                       ` Christoph Lameter
@ 2008-05-14 21:02                         ` Matt Mackall
  2008-05-14 21:26                           ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matt Mackall @ 2008-05-14 21:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin


On Wed, 2008-05-14 at 13:33 -0700, Christoph Lameter wrote:
> On Wed, 14 May 2008, Matt Mackall wrote:
> 
> > > allocator?
> > 
> > No, slow path here means we're already under memory pressure, so we
> > don't care if something takes longer if it saves memory.
> 
> So expire the queues under memory pressure only? Trigger queue cleanup 
> from reclaim?

This wouldn't be my first thought. The batch size could be potentially
huge and we'd have to worry about latency issues.

But here are some other thoughts:

First, we should obviously always expire all queues when we hit low
water marks as it'll be cheaper/faster than other forms of reclaim.

Second, if our queues were per-slab (this might be hard, I realize), we
can sweep the queue at alloc time.

We can also sweep before falling back to the page allocator. That should
guarantee that delayed frees don't negatively impact fragmentation.

And lastly, we can always have a periodic thread/timer/workqueue
operation.

So far this is a bunch of hand-waving but I think this ends up basically
being an anti-magazine. A magazine puts a per-cpu queue on the alloc
side which costs on both the alloc and free side, regardless of whether
the workload demands it. This puts a per-cpu queue on the free side that
we can bypass in the cache-friendly case. I think that's a step in the
right direction.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:00                     ` Christoph Lameter
@ 2008-05-14 21:21                       ` Matthew Wilcox
  2008-05-14 21:33                         ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-14 21:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, May 14, 2008 at 02:00:15PM -0700, Christoph Lameter wrote:
> On Wed, 14 May 2008, Matthew Wilcox wrote:
> 
> > On Wed, May 14, 2008 at 01:46:52PM -0700, Christoph Lameter wrote:
> > > Some more on SMP scaling:
> > 
> > These are all great theories, and you mentioned that you'd fixed the
> > regressions with tbench, but did you fix the regression with the io-gen
> > program I sent you?
> 
> No. I thought you were satisfied with the performance increase you saw 
> when pinning the process to a single processor?

Er, no.  That program emulates a TPC-C run from the point of view of
doing as much IO as possible from all CPUs.  Pinning the process to one
CPU would miss the point somewhat.

I seem to remember telling you that you might get more realistic
performance numbers by pinning the scsi_ram_0 kernel thread to a single
CPU (ie emulating an interrupt tied to one CPU rather than letting the
scheduler choose to run the thread on the 'best' CPU).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:02                         ` Matt Mackall
@ 2008-05-14 21:26                           ` Christoph Lameter
  2008-05-14 21:54                             ` Matt Mackall
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 21:26 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin

On Wed, 14 May 2008, Matt Mackall wrote:

> First, we should obviously always expire all queues when we hit low
> water marks as it'll be cheaper/faster than other forms of reclaim.

Hmmm... I tried a scheme like that awhile back but it did not improve 
performance. The cost of queuing the object degraded the fast path (Note 
SLUB object queuing is fundamentally different due to no in slab 
metadata structure).

> Second, if our queues were per-slab (this might be hard, I realize), we
> can sweep the queue at alloc time.

In that case we dirty the same cacheline that we also need to take the 
page lock. Wonder if there would be any difference? The freelist is 
essentially a kind of per page queue (As pointed out by Ingo in the past).

> We can also sweep before falling back to the page allocator. That should
> guarantee that delayed frees don't negatively impact fragmentation.

That would introduce additional complexity for the NUMA case because now 
we would need to distinguish between the nodes that these objects came 
from. So we would have to scan the queue and classify the objects? Or 
determine the object node when queueing them and put them into an remote 
node queue? Sounds similar to all the trouble that we ended up with 
in SLAB.

> And lastly, we can always have a periodic thread/timer/workqueue
> operation.

I have had enough trouble in the last years with the 2 second hiccups that 
come with SLAB and that affect timing sensitive operations between 
processors in a SMP configuration and also cause trouble for applications 
that require fast network latencies. I'd rather avoid that.
 
> So far this is a bunch of hand-waving but I think this ends up basically
> being an anti-magazine. A magazine puts a per-cpu queue on the alloc
> side which costs on both the alloc and free side, regardless of whether
> the workload demands it. This puts a per-cpu queue on the free side that
> we can bypass in the cache-friendly case. I think that's a step in the
> right direction.

I think if you want queues for an SMP only system, do not care too much 
about memory use, dont do any frequent allocations on multicore systems 
and can tolerate the hiccups because your application does not care (most 
enterprise apps are constructed that way) or if you are running benchmarks 
that only access a limited dataset that fits into SLABs queues amd 
avoid touch the contenst of objects then the SLAB concept is the right way 
to go.

If we would strip the NUMA stuff out and make it an SMP only allocator for 
enterprise apps then the code may become much smaller and simpler. I guess 
Arjan suggested something similar in the past. But that would result in 
SLAB no longer being a general allocator.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:21                       ` Matthew Wilcox
@ 2008-05-14 21:33                         ` Christoph Lameter
  2008-05-14 21:43                           ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 21:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, 14 May 2008, Matthew Wilcox wrote:

> > No. I thought you were satisfied with the performance increase you saw 
> > when pinning the process to a single processor?
> 
> Er, no.  That program emulates a TPC-C run from the point of view of
> doing as much IO as possible from all CPUs.  Pinning the process to one
> CPU would miss the point somewhat.

Oh. The last message I got was an enthusiatic report on the performance 
gains you saw by pinning the process after we looked at slub statistics 
that showed that the behavior of the tests was different from your 
expectations. I got messages here that indicate that this was a scsi 
testing program that you had under development. And yes we saw the remote 
freeing degradations there.

> I seem to remember telling you that you might get more realistic
> performance numbers by pinning the scsi_ram_0 kernel thread to a single
> CPU (ie emulating an interrupt tied to one CPU rather than letting the
> scheduler choose to run the thread on the 'best' CPU).

If this is a stand in for the TPC then why did you not point that 
out when Pekka and I recently asked you to retest some configurations?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:33                         ` Christoph Lameter
@ 2008-05-14 21:43                           ` Matthew Wilcox
  2008-05-14 21:53                             ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-14 21:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, May 14, 2008 at 02:33:11PM -0700, Christoph Lameter wrote:
> On Wed, 14 May 2008, Matthew Wilcox wrote:
> 
> > > No. I thought you were satisfied with the performance increase you saw 
> > > when pinning the process to a single processor?
> > 
> > Er, no.  That program emulates a TPC-C run from the point of view of
> > doing as much IO as possible from all CPUs.  Pinning the process to one
> > CPU would miss the point somewhat.
> 
> Oh. The last message I got was an enthusiatic report on the performance 
> gains you saw by pinning the process after we looked at slub statistics 
> that showed that the behavior of the tests was different from your 
> expectations. I got messages here that indicate that this was a scsi 
> testing program that you had under development. And yes we saw the remote 
> freeing degradations there.

What I said was:

: I've also been playing around with locking the scsi_ram_0 thread to
: one CPU and it has a huge effect on the numbers.

: So we can see that scsi_ram_0 is clearly wandering between the two
: CPUs normally; it takes up a significant (3 seconds ~= 7-8%) of the
: execution time, and that locking it to one CPU (which interrupts tend
: to be) improves the number of ops per second ... even of the CPU which
: is forced to take all the extra work of running it!

Note the complete lack of comparison between slub and slab here!  As far
as I know, slub still loses against slab by a few % -- but I haven't
finished running a comparison with -rc2 yet.

> > I seem to remember telling you that you might get more realistic
> > performance numbers by pinning the scsi_ram_0 kernel thread to a single
> > CPU (ie emulating an interrupt tied to one CPU rather than letting the
> > scheduler choose to run the thread on the 'best' CPU).
> 
> If this is a stand in for the TPC then why did you not point that 
> out when Pekka and I recently asked you to retest some configurations?

I thought you'd already run this test and were asking for the results of
this to be validated against a real TPC run.

I'm rather annoyed by this.  You demand a test-case to reproduce the
problem and then when I come up with one, you ignore it!

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:43                           ` Matthew Wilcox
@ 2008-05-14 21:53                             ` Christoph Lameter
  2008-05-14 22:00                               ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 21:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, 14 May 2008, Matthew Wilcox wrote:

> > Oh. The last message I got was an enthusiatic report on the performance 
> > gains you saw by pinning the process after we looked at slub statistics 
> > that showed that the behavior of the tests was different from your 
> > expectations. I got messages here that indicate that this was a scsi 
> > testing program that you had under development. And yes we saw the remote 
> > freeing degradations there.
> 
> What I said was:
> 
> : I've also been playing around with locking the scsi_ram_0 thread to
> : one CPU and it has a huge effect on the numbers.
> 
> : So we can see that scsi_ram_0 is clearly wandering between the two
> : CPUs normally; it takes up a significant (3 seconds ~= 7-8%) of the
> : execution time, and that locking it to one CPU (which interrupts tend
> : to be) improves the number of ops per second ... even of the CPU which
> : is forced to take all the extra work of running it!


The last message that I got on March 31st said:

>>I have a version below which tries to start the tasks at a similar time
>by using pause() and then signalling to wake all the tasks up.  I don't
>?know a better way to start threads simultaneously ... maybe MAP_SHARED a
>file and write to it in one task while spinning in the other tasks
>waiting for it to change value?

>I've also been playing around with locking the scsi_ram_0 thread to one
>CPU and it has a huge effect on the numbers.

This indicated to me that you were still developing a test here and 
discovered some startling things.

> Note the complete lack of comparison between slub and slab here!  As far
> as I know, slub still loses against slab by a few % -- but I haven't
> finished running a comparison with -rc2 yet.

Indeed remote frees are slightly slower in some situations. Dont really 
dispute that. I am just not sure that the TPC test is really suffering 
from that symptom. I thought for a long time that the tbench regression 
was due to a similar effect too until I got down to it.

> I thought you'd already run this test and were asking for the results of
> this to be validated against a real TPC run.

AFAICT the last state was that you were tinkering around with a test.

> I'm rather annoyed by this.  You demand a test-case to reproduce the
> problem and then when I come up with one, you ignore it!

Ignore it? That is pretty strange statement given that I helped you 
analyze the behavior of your test and understand what was going on the 
system.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:26                           ` Christoph Lameter
@ 2008-05-14 21:54                             ` Matt Mackall
  2008-05-15 17:15                               ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matt Mackall @ 2008-05-14 21:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin


On Wed, 2008-05-14 at 14:26 -0700, Christoph Lameter wrote:
> > So far this is a bunch of hand-waving but I think this ends up
> basically
> > being an anti-magazine. A magazine puts a per-cpu queue on the alloc
> > side which costs on both the alloc and free side, regardless of whether
> > the workload demands it. This puts a per-cpu queue on the free side that
> > we can bypass in the cache-friendly case. I think that's a step in the
> > right direction.
> 
> I think if you want queues for an SMP only system, do not care too much 
> about memory use, dont do any frequent allocations on multicore systems 
> and can tolerate the hiccups because your application does not care (most 
> enterprise apps are constructed that way) or if you are running benchmarks 
> that only access a limited dataset that fits into SLABs queues amd 
> avoid touch the contenst of objects then the SLAB concept is the right way 
> to go.

> If we would strip the NUMA stuff out and make it an SMP only allocator for 
> enterprise apps then the code may become much smaller and simpler. I guess 
> Arjan suggested something similar in the past. But that would result in 
> SLAB no longer being a general allocator.

What does this have to do with anything? I'm not talking about going
back to SLAB. I'm talking about plugging the use cases where SLUB
currently loses to SLAB. That's what has to happen before SLAB can be
obsoleted.

I'll certainly grant you that queueing might not break even.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:53                             ` Christoph Lameter
@ 2008-05-14 22:00                               ` Matthew Wilcox
  2008-05-14 22:32                                 ` Christoph Lameter
  2008-05-14 22:34                                 ` Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-14 22:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, May 14, 2008 at 02:53:57PM -0700, Christoph Lameter wrote:
> > Note the complete lack of comparison between slub and slab here!  As far
> > as I know, slub still loses against slab by a few % -- but I haven't
> > finished running a comparison with -rc2 yet.
> 
> Indeed remote frees are slightly slower in some situations. Dont really 
> dispute that. I am just not sure that the TPC test is really suffering 
> from that symptom. I thought for a long time that the tbench regression 
> was due to a similar effect too until I got down to it.

Since there's no way we've found to date to get the TPC test to you,
how about we settle for analysing _this_ testcase which did show a
significant performance degradation for slub?

I don't think it's an unreasonable testcase either -- effectively it's
allocating memory on all CPUs and then freeing it all on one.  If that's
a worst-case scenario for slub, then slub isn't suitable for replacing
slab yet.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 22:00                               ` Matthew Wilcox
@ 2008-05-14 22:32                                 ` Christoph Lameter
  2008-05-14 22:34                                 ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 22:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, 14 May 2008, Matthew Wilcox wrote:

> Since there's no way we've found to date to get the TPC test to you,
> how about we settle for analysing _this_ testcase which did show a
> significant performance degradation for slub?
> 
> I don't think it's an unreasonable testcase either -- effectively it's
> allocating memory on all CPUs and then freeing it all on one.  If that's
> a worst-case scenario for slub, then slub isn't suitable for replacing
> slab yet.

Indeed that is a worst case scenario due to finer grained locking. The 
opposite side of that is that fast concurrent freeing of objects from two 
processors will have higher performance in slub since there is 
significantly less global lock contention and less work with expiring 
objects and moving them around (if you hit the queue limits then SLAB 
will do synchroonous merging of objects into slabs, its then no longer 
able to hide the object handling overhead in cache_reap().)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 22:00                               ` Matthew Wilcox
  2008-05-14 22:32                                 ` Christoph Lameter
@ 2008-05-14 22:34                                 ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-14 22:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Zhang, Yanmin

On Wed, 14 May 2008, Matthew Wilcox wrote:

> Since there's no way we've found to date to get the TPC test to you,
> how about we settle for analysing _this_ testcase which did show a
> significant performance degradation for slub?

Could I get the latest version of the test? Or was the March 31st version 
the latest? No later changes?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 18:03               ` Christoph Lameter
  2008-05-14 18:18                 ` Matt Mackall
@ 2008-05-15  3:26                 ` Zhang, Yanmin
  2008-05-15 17:05                   ` Christoph Lameter
  1 sibling, 1 reply; 93+ messages in thread
From: Zhang, Yanmin @ 2008-05-15  3:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox


On Wed, 2008-05-14 at 11:03 -0700, Christoph Lameter wrote:
> On Wed, 14 May 2008, Andi Kleen wrote:
> 
> > iirc profiling analysis showed that the problem was the page lock
> > serialization (in particular the slab_lock() in __slab_free). That
> > was on 2.6.24.2
> 
> Do you have an URL?
> 
> > I think the problem is that this atomic operation thrashes cache lines
> > around. Really counting cycles on instructions is not that interesting,
> > but minimizing the cache thrashing is. And for that it looks like slub
> > is worse.
> 
> It can thrash cachelines if objects from the same slab page are freed 
> simultaneously on multiple processors. That occurred in the hackbench 
> regression that we addressed with the dynamic configuration of slab sizes.
hackbench regression is because of slow allocation instead of slow freeing.
With dynamic configuration of slab sizes, fast allocation becomes 97% (the bad
one is 68%), but fast free is always 8~9% with/without the patch.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15  3:26                 ` Zhang, Yanmin
@ 2008-05-15 17:05                   ` Christoph Lameter
  2008-05-15 17:49                     ` Matthew Wilcox
  2008-05-16  5:16                     ` Zhang, Yanmin
  0 siblings, 2 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 17:05 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1228 bytes --]

On Thu, 15 May 2008, Zhang, Yanmin wrote:

> > It can thrash cachelines if objects from the same slab page are freed 
> > simultaneously on multiple processors. That occurred in the hackbench 
> > regression that we addressed with the dynamic configuration of slab sizes.
> hackbench regression is because of slow allocation instead of slow freeing.
> With ÿÿdynamic configuration of slab sizes, fast allocation becomes 97% (the bad
> one is 68%), but fast free is always 8~9% with/without the patch.

Thanks for using the slab statistics. I wish I had these numbers for the 
TPC benchmark. That would allow us to understand what is going on while it 
is running.

The frees in the hackbench were slow because partial list updates occurred 
to frequently. The first fix was to let slab sit longer on the partial 
list. The other was the increase of the slab sizes which also increases 
the per cpu slab size and therefore the objects allocatable without a 
round trip to the page allocator. Freeing to a per cpu slab never requires 
partial list updates. So the frees also benefitted from the larger slab 
sizes. But the effect shows up in the count of partial list updates not in 
the fast/free collumn.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-14 21:54                             ` Matt Mackall
@ 2008-05-15 17:15                               ` Christoph Lameter
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 17:15 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, Matthew Wilcox, Zhang,
	Yanmin

On Wed, 14 May 2008, Matt Mackall wrote:

> > If we would strip the NUMA stuff out and make it an SMP only allocator for 
> > enterprise apps then the code may become much smaller and simpler. I guess 
> > Arjan suggested something similar in the past. But that would result in 
> > SLAB no longer being a general allocator.
> 
> What does this have to do with anything? I'm not talking about going
> back to SLAB. I'm talking about plugging the use cases where SLUB
> currently loses to SLAB. That's what has to happen before SLAB can be
> obsoleted.

Both allocators have a different design which leads to different behavior. 
I do not think the expectation that one must always best the other is 
reasonable or even possible.

I'd be glad if we had some means of increasing the performance in the 
currently known cases where remote slab free becomes an issue by avoiding 
the atomic op.

AFAICT we so far have been able to compensate for the additional atomic op 
with a reduced cache footprint and less complexity overall on remote frees 
and also through improvements in alloc behavior. I hope that the current 
improvements in 2.6.26 are sufficient to address the concerns with TP-C 
(which I do not have direct access to and frankly I know very little about 
the setup etc). We are still not sure exactly why TP-C has a problem. The 
slab statistics were added to figure that one out. We can get a view of 
what is going on without having access to the system.

I think the current way of compensating for that atomic op is better than 
getting back to the queue mess. Maybe there is a way of limited use of 
queues that avoids the atomic op but so far I have not 
found one. Maybe someone else looking at it will have better ideas.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 18/21] Filesystem: Socket inode defragmentation
  2008-05-13 13:28   ` Evgeniy Polyakov
@ 2008-05-15 17:40     ` Christoph Lameter
  2008-05-15 18:23       ` Evgeniy Polyakov
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 17:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: akpm, linux-kernel, netdev, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Tue, 13 May 2008, Evgeniy Polyakov wrote:

> Out of curiosity, how can you drop socket inode, since it is always
> attached to socket which is removed automatically when connection is
> closed. Any force of dropping socket inode can only result in connection
> drop, i.e. there are no inodes, which are placed in cache and are not
> yet freed, if there are no attached sockets.
> 
> So question is how does it work for sockets?

All inodes are inactivated and put on a lru before they are freed. Those 
could be reclaimed by inode defrag. Socket inode defrag is not that 
important. Just shows that this can be applied in a general way.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-12  0:24   ` David Chinner
@ 2008-05-15 17:42     ` Christoph Lameter
  2008-05-15 23:10       ` David Chinner
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 17:42 UTC (permalink / raw)
  To: David Chinner
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Mon, 12 May 2008, David Chinner wrote:

> If you are going to clean bufferheads (or pages), please clean entire
> mappings via ->writepages as it leads to far superior I/O patterns
> and a far higher aggregate rate of page cleaning.....

That brings up another issue: Lets say I use writepages on a large file 
(couple of gig). How much do you want to write back?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:05                   ` Christoph Lameter
@ 2008-05-15 17:49                     ` Matthew Wilcox
  2008-05-15 17:58                       ` Christoph Lameter
                                         ` (2 more replies)
  2008-05-16  5:16                     ` Zhang, Yanmin
  1 sibling, 3 replies; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-15 17:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, May 15, 2008 at 10:05:35AM -0700, Christoph Lameter wrote:
> Thanks for using the slab statistics. I wish I had these numbers for the 
> TPC benchmark. That would allow us to understand what is going on while it 
> is running.

Hang on, you want slab statistics for the TPC run?  You didn't tell me
that.  We're trying to gather oprofile data (and having trouble because
the machine crashes when we start using oprofile -- this is with the git
tree you/pekka put together for us to test).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:49                     ` Matthew Wilcox
@ 2008-05-15 17:58                       ` Christoph Lameter
  2008-05-15 18:13                         ` Matthew Wilcox
  2008-05-15 18:19                       ` Eric Dumazet
  2008-05-15 18:29                       ` Vegard Nossum
  2 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 17:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, 15 May 2008, Matthew Wilcox wrote:

> On Thu, May 15, 2008 at 10:05:35AM -0700, Christoph Lameter wrote:
> > Thanks for using the slab statistics. I wish I had these numbers for the 
> > TPC benchmark. That would allow us to understand what is going on while it 
> > is running.
> 
> Hang on, you want slab statistics for the TPC run?  You didn't tell me
> that.  We're trying to gather oprofile data (and having trouble because
> the machine crashes when we start using oprofile -- this is with the git
> tree you/pekka put together for us to test).

Well we talked about this when you send me the test program. I just 
thought that it would be logical to do the same for the real case.

Details of the crash please?

You could just start with 2.6.25.X which already contains the slab 
statistics.

Also re: the test program since pinning a process does increase the 
performance by orders of magnitude. Are you sure that the application was 
properly tuned for an 8p configuration? Pinning is usually not necessary 
for lower numbers of processors because the scheduler thrashing effect is 
less of an issue.  If the test program is an accurate representation of 
the TP-C benchmark then you can drastically increase its performance by 
doing the same to the real test.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:58                       ` Christoph Lameter
@ 2008-05-15 18:13                         ` Matthew Wilcox
  2008-05-15 18:43                           ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-15 18:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, May 15, 2008 at 10:58:01AM -0700, Christoph Lameter wrote:
> On Thu, 15 May 2008, Matthew Wilcox wrote:
> 
> > On Thu, May 15, 2008 at 10:05:35AM -0700, Christoph Lameter wrote:
> > > Thanks for using the slab statistics. I wish I had these numbers for the 
> > > TPC benchmark. That would allow us to understand what is going on while it 
> > > is running.
> > 
> > Hang on, you want slab statistics for the TPC run?  You didn't tell me
> > that.  We're trying to gather oprofile data (and having trouble because
> > the machine crashes when we start using oprofile -- this is with the git
> > tree you/pekka put together for us to test).
> 
> Well we talked about this when you send me the test program. I just 
> thought that it would be logical to do the same for the real case.

You ran the test ... you didn't say "It would be helpful if you could
get these results for me for TPC-C".

> Details of the crash please?

I don't have any.

> You could just start with 2.6.25.X which already contains the slab 
> statistics.

Certainly.  Exactly how does collecting these stats work?  Am I supposed
to zero the counters after the TPC has done its initial ramp-up?  What
commands should I run, and at exactly which points?

> Also re: the test program since pinning a process does increase the 
> performance by orders of magnitude. Are you sure that the application was 
> properly tuned for an 8p configuration? Pinning is usually not necessary 
> for lower numbers of processors because the scheduler thrashing effect is 
> less of an issue.  If the test program is an accurate representation of 
> the TP-C benchmark then you can drastically increase its performance by 
> doing the same to the real test.

The application does nothing except submit IO and wait for it to complete.
It doesn't need to be tuned.  It's not an accurate representation of
TPC-C, it just simulates the amount of IO that a TPC-C run will generate
(and simulates it coming from all CPUs, which is accurate).

I don't want to get into details of how a TPC benchmark is tuned, because
it's not relevant.  Trust me, there are people who dedicate months of
their lives per year to tuning how TPC runs are scheduled.

The pinning I was talking about was pinning the scsi_ram_0 kernel thread
to one CPU to simulate interrupts being tied to one CPU.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:49                     ` Matthew Wilcox
  2008-05-15 17:58                       ` Christoph Lameter
@ 2008-05-15 18:19                       ` Eric Dumazet
  2008-05-15 18:29                       ` Vegard Nossum
  2 siblings, 0 replies; 93+ messages in thread
From: Eric Dumazet @ 2008-05-15 18:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, Zhang, Yanmin, Andi Kleen, Pekka Enberg,
	KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, mpm

Matthew Wilcox a écrit :
> On Thu, May 15, 2008 at 10:05:35AM -0700, Christoph Lameter wrote:
>   
>> Thanks for using the slab statistics. I wish I had these numbers for the 
>> TPC benchmark. That would allow us to understand what is going on while it 
>> is running.
>>     
>
> Hang on, you want slab statistics for the TPC run?  You didn't tell me
> that.  We're trying to gather oprofile data (and having trouble because
> the machine crashes when we start using oprofile -- this is with the git
> tree you/pekka put together for us to test).
>
>   
Hum, you might try to apply commit 
44c81433e8b05dbc85985d939046f10f95901184 or commit
8b8b498836942c0c855333d357d121c0adeefbd9

oprofile data are definitly wanted :)






^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 18/21] Filesystem: Socket inode defragmentation
  2008-05-15 17:40     ` Christoph Lameter
@ 2008-05-15 18:23       ` Evgeniy Polyakov
  0 siblings, 0 replies; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-15 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, netdev, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Thu, May 15, 2008 at 10:40:11AM -0700, Christoph Lameter (clameter@sgi.com) wrote:
> All inodes are inactivated and put on a lru before they are freed. Those 

I have to check my memory, but iput()->destroy_inode() highlights first for sockets...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:49                     ` Matthew Wilcox
  2008-05-15 17:58                       ` Christoph Lameter
  2008-05-15 18:19                       ` Eric Dumazet
@ 2008-05-15 18:29                       ` Vegard Nossum
  2 siblings, 0 replies; 93+ messages in thread
From: Vegard Nossum @ 2008-05-15 18:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, Zhang, Yanmin, Andi Kleen, Pekka Enberg,
	KOSAKI Motohiro, Rik van Riel, akpm, linux-kernel, linux-fsdevel,
	Mel Gorman, mpm

On Thu, May 15, 2008 at 7:49 PM, Matthew Wilcox <matthew@wil.cx> wrote:
> On Thu, May 15, 2008 at 10:05:35AM -0700, Christoph Lameter wrote:
>> Thanks for using the slab statistics. I wish I had these numbers for the
>> TPC benchmark. That would allow us to understand what is going on while it
>> is running.
>
> Hang on, you want slab statistics for the TPC run?  You didn't tell me
> that.  We're trying to gather oprofile data (and having trouble because
> the machine crashes when we start using oprofile -- this is with the git
> tree you/pekka put together for us to test).

Hi,

oprofile was recently fixed, maybe try cherry-picking these will help:

http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=7ded2dcf5f2c30889d7ac743ed64fff272ec190d
http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=08bc5caced1f322255f44880529a651e204a38eb
http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=2b56af59ed24d25be0282de9ae98c290e03a7dd9

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 18:13                         ` Matthew Wilcox
@ 2008-05-15 18:43                           ` Christoph Lameter
  2008-05-15 18:51                             ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 18:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, 15 May 2008, Matthew Wilcox wrote:

> > Details of the crash please?
> 
> I don't have any.

Well the amount of information we are getting has always been the main 
factor in delaying this further and further. Sigh.

> > You could just start with 2.6.25.X which already contains the slab 
> > statistics.
> 
> Certainly.  Exactly how does collecting these stats work?  Am I supposed
> to zero the counters after the TPC has done its initial ramp-up?  What
> commands should I run, and at exactly which points?

Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the 
help text provided when enabling statistics).

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 18:43                           ` Christoph Lameter
@ 2008-05-15 18:51                             ` Matthew Wilcox
  2008-05-15 19:09                               ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-15 18:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, May 15, 2008 at 11:43:40AM -0700, Christoph Lameter wrote:
> On Thu, 15 May 2008, Matthew Wilcox wrote:
> 
> > > Details of the crash please?
> > 
> > I don't have any.
> 
> Well the amount of information we are getting has always been the main 
> factor in delaying this further and further. Sigh.

Or possibly your assumptions have been the main factor.  I gave you a
reproducer for this problem 6 weeks ago.  As far as I can tell, you
haven't run it since.

> > > You could just start with 2.6.25.X which already contains the slab 
> > > statistics.
> > 
> > Certainly.  Exactly how does collecting these stats work?  Am I supposed
> > to zero the counters after the TPC has done its initial ramp-up?  What
> > commands should I run, and at exactly which points?
> 
> Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the 
> help text provided when enabling statistics).

That's an utterly unhelpful answer.  Let me try asking again.

Exactly how does collecting these stats work?  Am I supposed to zero
the counters after the TPC has done its initial ramp-up?  What commands
should I run, and at exactly which points?

Otherwise I'll get something wrong and these numbers will be useless to
you.  Or that's what you'll claim anyway.

For reference the helptext says:

          SLUB statistics are useful to debug SLUBs allocation behavior in
          order find ways to optimize the allocator. This should never be
          enabled for production use since keeping statistics slows down
          the allocator by a few percentage points. The slabinfo command
          supports the determination of the most active slabs to figure
          out which slabs are relevant to a particular load.
          Try running: slabinfo -DA

By the way, when you say 'compile slabinfo', you mean the file shipped
as Documentation/vm/slabinfo.c (rather than, say, something out of tree?)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 18:51                             ` Matthew Wilcox
@ 2008-05-15 19:09                               ` Christoph Lameter
  2008-05-15 19:29                                 ` Matthew Wilcox
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-15 19:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, 15 May 2008, Matthew Wilcox wrote:

> Or possibly your assumptions have been the main factor.  I gave you a
> reproducer for this problem 6 weeks ago.  As far as I can tell, you
> haven't run it since.

Assumptions may be the issue. My own "reproducer" for remote frees is 
available from my git tree and I usually prefer to run my own. We 
discussed the results of that program last fall. You stated yesterday that 
your code is proprietary. I am not sure what I am allowed to do with the 
code. I did not know that it was proprietary before yesterday and I would 
have just forwarded that code to Pekka yesterday if I would not have 
caught that message in time.

I thought that what you provided it was a test program to exercise 
and optimize the scsi subsystem?

> > > > You could just start with 2.6.25.X which already contains the slab 
> > > > statistics.
> > > 
> > > Certainly.  Exactly how does collecting these stats work?  Am I supposed
> > > to zero the counters after the TPC has done its initial ramp-up?  What
> > > commands should I run, and at exactly which points?
> > 
> > Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the 
> > help text provided when enabling statistics).
> 
> That's an utterly unhelpful answer.  Let me try asking again.
> 
> Exactly how does collecting these stats work?  Am I supposed to zero
> the counters after the TPC has done its initial ramp-up?  What commands
> should I run, and at exactly which points?

There is no way of zeroing the counters. Run slabinfo -AD after the 
test application has been running for awhile. If you want a differential 
then you have to take two datapoints.
 
> Otherwise I'll get something wrong and these numbers will be useless to
> you.  Or that's what you'll claim anyway.

No. I guess I will end up with a lot of guess work of what is going on on 
the system since the information is limited for some reason.

> For reference the helptext says:
> 
>           SLUB statistics are useful to debug SLUBs allocation behavior in
>           order find ways to optimize the allocator. This should never be
>           enabled for production use since keeping statistics slows down
>           the allocator by a few percentage points. The slabinfo command
>           supports the determination of the most active slabs to figure
>           out which slabs are relevant to a particular load.
>           Try running: slabinfo -DA
> 
> By the way, when you say 'compile slabinfo', you mean the file shipped
> as Documentation/vm/slabinfo.c (rather than, say, something out of tree?)

Yes. 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 19:09                               ` Christoph Lameter
@ 2008-05-15 19:29                                 ` Matthew Wilcox
  2008-05-15 20:14                                   ` Matthew Wilcox
  2008-05-16 19:06                                   ` Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-15 19:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, May 15, 2008 at 12:09:06PM -0700, Christoph Lameter wrote:
> Assumptions may be the issue. My own "reproducer" for remote frees is 
> available from my git tree and I usually prefer to run my own. We 

No doubt you prefer to run a test which fails to show a problem with
your code.  How about you try running a test which does show a problem?

> discussed the results of that program last fall. You stated yesterday that 
> your code is proprietary. I am not sure what I am allowed to do with the 
> code. I did not know that it was proprietary before yesterday and I would 
> have just forwarded that code to Pekka yesterday if I would not have 
> caught that message in time.

I'm surprised you're so cavalier about copyright.  There was nothing in
that code which permitted you to redistribute it.

> I thought that what you provided it was a test program to exercise 
> and optimize the scsi subsystem?

Why would you think that?  The subject of the email was "Slub test
program".

> > > Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the 
> > > help text provided when enabling statistics).
> > 
> > That's an utterly unhelpful answer.  Let me try asking again.
> > 
> > Exactly how does collecting these stats work?  Am I supposed to zero
> > the counters after the TPC has done its initial ramp-up?  What commands
> > should I run, and at exactly which points?
> 
> There is no way of zeroing the counters. Run slabinfo -AD after the 
> test application has been running for awhile. If you want a differential 
> then you have to take two datapoints.

Is a differential interesting to you?

> > Otherwise I'll get something wrong and these numbers will be useless to
> > you.  Or that's what you'll claim anyway.
> 
> No. I guess I will end up with a lot of guess work of what is going on on 
> the system since the information is limited for some reason.

They're your statistics.  Tell me what you need.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 19:29                                 ` Matthew Wilcox
@ 2008-05-15 20:14                                   ` Matthew Wilcox
  2008-05-15 20:30                                     ` Pekka Enberg
  2008-05-16 19:17                                     ` Christoph Lameter
  2008-05-16 19:06                                   ` Christoph Lameter
  1 sibling, 2 replies; 93+ messages in thread
From: Matthew Wilcox @ 2008-05-15 20:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, May 15, 2008 at 01:29:59PM -0600, Matthew Wilcox wrote:
> On Thu, May 15, 2008 at 12:09:06PM -0700, Christoph Lameter wrote:
> > Assumptions may be the issue. My own "reproducer" for remote frees is 
> > available from my git tree and I usually prefer to run my own. We 
> 
> No doubt you prefer to run a test which fails to show a problem with
> your code.  How about you try running a test which does show a problem?

This is rather interesting.  Since Christoph refuses to, here's my
results with 8f40f67, first with slab:

willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 0 completed 1000000 ops in 52.817 seconds; 18933 ops per second
CPU 2 completed 1000000 ops in 56.391 seconds; 17733 ops per second
CPU 3 completed 1000000 ops in 57.009 seconds; 17541 ops per second
CPU 1 completed 1000000 ops in 57.591 seconds; 17363 ops per second
willy@piggy:~$ sudo taskset -p 1 941
pid 941's current affinity mask: f
pid 941's new affinity mask: 1
willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 2 completed 1000000 ops in 46.740 seconds; 21394 ops per second
CPU 0 completed 1000000 ops in 48.716 seconds; 20527 ops per second
CPU 3 completed 1000000 ops in 59.255 seconds; 16876 ops per second
CPU 1 completed 1000000 ops in 60.473 seconds; 16536 ops per second

(the pid is that of scsi_ram_0)

Now, change the config to slub:

--- 64-slab/.config     2008-05-15 15:21:31.000000000 -0400
+++ 64-slub/.config     2008-05-15 15:37:45.000000000 -0400
-# Thu May 15 15:21:31 2008
+# Thu May 15 15:37:45 2008
-CONFIG_SLAB=y
-# CONFIG_SLUB is not set
+CONFIG_SLUB_DEBUG=y
+# CONFIG_SLAB is not set
+CONFIG_SLUB=y
-# CONFIG_DEBUG_SLAB is not set
+# CONFIG_SLUB_DEBUG_ON is not set
+# CONFIG_SLUB_STATS is not set

and we get slightly better results:

willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 0 completed 1000000 ops in 45.848 seconds; 21811 ops per second
CPU 2 completed 1000000 ops in 50.789 seconds; 19689 ops per second
CPU 3 completed 1000000 ops in 55.876 seconds; 17896 ops per second
CPU 1 completed 1000000 ops in 56.941 seconds; 17562 ops per second
willy@piggy:~$ sudo taskset -p 1 1001
pid 1001's current affinity mask: f
pid 1001's new affinity mask: 1
willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 2 completed 1000000 ops in 45.713 seconds; 21875 ops per second
CPU 0 completed 1000000 ops in 47.020 seconds; 21267 ops per second
CPU 3 completed 1000000 ops in 58.692 seconds; 17038 ops per second
CPU 1 completed 1000000 ops in 60.389 seconds; 16559 ops per second

Slub's clearly in the lead, right?  Maybe.  Here's the results we get
with 2.6.25+slab:

willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 3 completed 1000000 ops in 48.709 seconds; 20530 ops per second
CPU 1 completed 1000000 ops in 50.181 seconds; 19927 ops per second
CPU 0 completed 1000000 ops in 53.511 seconds; 18687 ops per second
CPU 2 completed 1000000 ops in 54.169 seconds; 18460 ops per second
willy@piggy:~$ sudo taskset -p 1 930
pid 930's current affinity mask: f
pid 930's new affinity mask: 1
willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 2 completed 1000000 ops in 40.568 seconds; 24649 ops per second
CPU 0 completed 1000000 ops in 47.986 seconds; 20839 ops per second
CPU 3 completed 1000000 ops in 55.944 seconds; 17875 ops per second
CPU 1 completed 1000000 ops in 56.180 seconds; 17799 ops per second

I think I'm going to try backing out some of the recent patches that
have gone into /slab/ and see if it's been regressing.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 20:14                                   ` Matthew Wilcox
@ 2008-05-15 20:30                                     ` Pekka Enberg
  2008-05-16 19:17                                     ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Pekka Enberg @ 2008-05-15 20:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, Zhang, Yanmin, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

Matthew Wilcox wrote:
> This is rather interesting.  Since Christoph refuses to, here's my
> results with 8f40f67, first with slab:
> 
> Now, change the config to slub:

[snip]

> and we get slightly better results:
> 
> willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
> CPU 0 completed 1000000 ops in 45.848 seconds; 21811 ops per second
> CPU 2 completed 1000000 ops in 50.789 seconds; 19689 ops per second
> CPU 3 completed 1000000 ops in 55.876 seconds; 17896 ops per second
> CPU 1 completed 1000000 ops in 56.941 seconds; 17562 ops per second
> willy@piggy:~$ sudo taskset -p 1 1001
> pid 1001's current affinity mask: f
> pid 1001's new affinity mask: 1
> willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
> CPU 2 completed 1000000 ops in 45.713 seconds; 21875 ops per second
> CPU 0 completed 1000000 ops in 47.020 seconds; 21267 ops per second
> CPU 3 completed 1000000 ops in 58.692 seconds; 17038 ops per second
> CPU 1 completed 1000000 ops in 60.389 seconds; 16559 ops per second

Slabinfo -A -r before and after the run would be nice (you 
CONFIG_SLUB_STATS and Documentation/vm/slabinfo.c for the latter). A 
separate oprofile would be nice as well. Thanks!

		Pekka

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-15 17:42     ` Christoph Lameter
@ 2008-05-15 23:10       ` David Chinner
  2008-05-16 17:01         ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-15 23:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Chinner, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> On Mon, 12 May 2008, David Chinner wrote:
> 
> > If you are going to clean bufferheads (or pages), please clean entire
> > mappings via ->writepages as it leads to far superior I/O patterns
> > and a far higher aggregate rate of page cleaning.....
> 
> That brings up another issue: Lets say I use writepages on a large file 
> (couple of gig). How much do you want to write back?

We're out of memory. I'd suggest write backing as much as you can
without blocking.  e.g. treat it like pdflush and say 1024 pages, or
like balance_dirty_pages() and write a 'write_chunk' back from the
mapping (i.e.  sync_writeback_pages()).

Any of these are better from an I/O perspective than single page
writeback....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 17:05                   ` Christoph Lameter
  2008-05-15 17:49                     ` Matthew Wilcox
@ 2008-05-16  5:16                     ` Zhang, Yanmin
  1 sibling, 0 replies; 93+ messages in thread
From: Zhang, Yanmin @ 2008-05-16  5:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Pekka Enberg, KOSAKI Motohiro, Rik van Riel, akpm,
	linux-kernel, linux-fsdevel, Mel Gorman, mpm, Matthew Wilcox


On Thu, 2008-05-15 at 10:05 -0700, Christoph Lameter wrote:
> On Thu, 15 May 2008, Zhang, Yanmin wrote:
> 
> > > It can thrash cachelines if objects from the same slab page are freed 
> > > simultaneously on multiple processors. That occurred in the hackbench 
> > > regression that we addressed with the dynamic configuration of slab sizes.
> > hackbench regression is because of slow allocation instead of slow freeing.
> > With ÿÿdynamic configuration of slab sizes, fast allocation becomes 97% (the bad
> > one is 68%), but fast free is always 8~9% with/without the patch.
> 
> Thanks for using the slab statistics. I wish I had these numbers for the 
> TPC benchmark. That would allow us to understand what is going on while it 
> is running.
> 
> The frees in the hackbench were slow because partial list updates occurred 
> to frequently. The first fix was to let slab sit longer on the partial 
> list. 
I forgot that. 2.6.24 merged the patch.

> The other was the increase of the slab sizes which also increases 
> the per cpu slab size and therefore the objects allocatable without a 
> round trip to the page allocator.
That is what I am talking. 2.6.26-rc merged the patch.

>  Freeing to a per cpu slab never requires 
> partial list updates. So the frees also benefitted from the larger slab 
> sizes. But the effect shows up in the count of partial list updates not in 
> the fast/free collumn.
I agree. It might be better if SLUB could be optimized again to have more consideration
when the slow free percentage is high, because the page lock might ping-pong
among processors if multi-processors access the same slab at the same time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-15 23:10       ` David Chinner
@ 2008-05-16 17:01         ` Christoph Lameter
  2008-05-19  5:45           ` David Chinner
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-16 17:01 UTC (permalink / raw)
  To: David Chinner
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Fri, 16 May 2008, David Chinner wrote:

> On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> > On Mon, 12 May 2008, David Chinner wrote:
> > 
> > > If you are going to clean bufferheads (or pages), please clean entire
> > > mappings via ->writepages as it leads to far superior I/O patterns
> > > and a far higher aggregate rate of page cleaning.....
> > 
> > That brings up another issue: Lets say I use writepages on a large file 
> > (couple of gig). How much do you want to write back?
> 
> We're out of memory. I'd suggest write backing as much as you can
> without blocking.  e.g. treat it like pdflush and say 1024 pages, or
> like balance_dirty_pages() and write a 'write_chunk' back from the
> mapping (i.e.  sync_writeback_pages()).

Why are we out of memory? How do you trigger such a special writeout?
 
> Any of these are better from an I/O perspective than single page
> writeback....

But then filesystem can do tricks like writing out the surrounding areas 
as needed. The filesystem likely can estimate better how much writeout 
makes sense.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 19:29                                 ` Matthew Wilcox
  2008-05-15 20:14                                   ` Matthew Wilcox
@ 2008-05-16 19:06                                   ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-16 19:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, 15 May 2008, Matthew Wilcox wrote:

> On Thu, May 15, 2008 at 12:09:06PM -0700, Christoph Lameter wrote:
> > Assumptions may be the issue. My own "reproducer" for remote frees is 
> > available from my git tree and I usually prefer to run my own. We 
> 
> No doubt you prefer to run a test which fails to show a problem with
> your code.  How about you try running a test which does show a problem?

The test was designed to show the worst case effect of the additional 
atomic op and it does its job. Look at the tests branch of my vm git tree.

> > There is no way of zeroing the counters. Run slabinfo -AD after the 
> > test application has been running for awhile. If you want a differential 
> > then you have to take two datapoints.
> 
> Is a differential interesting to you?

Depends on how much other stuff is going on before.

> 
> > > Otherwise I'll get something wrong and these numbers will be useless to
> > > you.  Or that's what you'll claim anyway.
> > 
> > No. I guess I will end up with a lot of guess work of what is going on on 
> > the system since the information is limited for some reason.
> 
> They're your statistics.  Tell me what you need.

The output of slabinfo -AD.... 



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 21/21] slab defrag: Obsolete SLAB
  2008-05-15 20:14                                   ` Matthew Wilcox
  2008-05-15 20:30                                     ` Pekka Enberg
@ 2008-05-16 19:17                                     ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-16 19:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Andi Kleen, Pekka Enberg, KOSAKI Motohiro,
	Rik van Riel, akpm, linux-kernel, linux-fsdevel, Mel Gorman, mpm

On Thu, 15 May 2008, Matthew Wilcox wrote:

> > No doubt you prefer to run a test which fails to show a problem with
> > your code.  How about you try running a test which does show a problem?
> 
> This is rather interesting.  Since Christoph refuses to, here's my
> results with 8f40f67, first with slab:

I sure wish you would follow the discussions instead of having paranoid 
thoughts about me not running tests that show regressions. See my 
extensive test suite that shows the worst cases in my vm git tree.

> I think I'm going to try backing out some of the recent patches that
> have gone into /slab/ and see if it's been regressing.

Hmmm... Interesting. Could you post the output of slabinfo -AD?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-16 17:01         ` Christoph Lameter
@ 2008-05-19  5:45           ` David Chinner
  2008-05-19 16:44             ` Christoph Lameter
  2008-05-20 22:53             ` Jamie Lokier
  0 siblings, 2 replies; 93+ messages in thread
From: David Chinner @ 2008-05-19  5:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Chinner, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Fri, May 16, 2008 at 10:01:38AM -0700, Christoph Lameter wrote:
> On Fri, 16 May 2008, David Chinner wrote:
> 
> > On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> > > On Mon, 12 May 2008, David Chinner wrote:
> > > 
> > > > If you are going to clean bufferheads (or pages), please clean entire
> > > > mappings via ->writepages as it leads to far superior I/O patterns
> > > > and a far higher aggregate rate of page cleaning.....
> > > 
> > > That brings up another issue: Lets say I use writepages on a large file 
> > > (couple of gig). How much do you want to write back?
> > 
> > We're out of memory. I'd suggest write backing as much as you can
> > without blocking.  e.g. treat it like pdflush and say 1024 pages, or
> > like balance_dirty_pages() and write a 'write_chunk' back from the
> > mapping (i.e.  sync_writeback_pages()).
> 
> Why are we out of memory?

Defragmentation is triggered as part of the usual memory reclaim
process. Which implies we've run out of free memory, correct?

> How do you trigger such a special writeout?

filemap_fdatawrite_range() perhaps?

> > Any of these are better from an I/O perspective than single page
> > writeback....
> 
> But then filesystem can do tricks like writing out the surrounding areas 
> as needed. The filesystem likely can estimate better how much writeout 
> makes sense.

Pushing write-around into a method that is only supposed to write
the single page that is passed to it is a pretty bad abuse of the
API. Especially as we have many simple, ranged writeback methods
you could call. filemap_fdatawrite_range(), do_writepages(),
->writepages, etc.

FWIW, look at the mess of layering violations that write clustering
causes in XFS because we have to do this to keep allocation overhead
and fragmentation down to a minimum. It's a nasty hack to mitigate
the impact of the awful I/O patterns we see from the VM - suggesting
that all filesystems do this just so you don't have to call a
slightly smarter writeback primitive is insane....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-19  5:45           ` David Chinner
@ 2008-05-19 16:44             ` Christoph Lameter
  2008-05-20  0:25               ` David Chinner
  2008-05-20 22:53             ` Jamie Lokier
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2008-05-19 16:44 UTC (permalink / raw)
  To: David Chinner
  Cc: akpm, linux-kernel, linux-fsdevel, Mel Gorman, andi,
	Rik van Riel, Pekka Enberg, mpm

On Mon, 19 May 2008, David Chinner wrote:

> Defragmentation is triggered as part of the usual memory reclaim
> process. Which implies we've run out of free memory, correct?

Yes but we have already reclaimed some memory.

> > How do you trigger such a special writeout?
> 
> filemap_fdatawrite_range() perhaps?

Could you provide me such a patch? I would not know how much to writeout. 
If we had such a method then we could also use that for the swap case 
where we also write out single pages?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-19 16:44             ` Christoph Lameter
@ 2008-05-20  0:25               ` David Chinner
  2008-05-20  6:56                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-20  0:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Chinner, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Mon, May 19, 2008 at 09:44:11AM -0700, Christoph Lameter wrote:
> On Mon, 19 May 2008, David Chinner wrote:
> 
> > Defragmentation is triggered as part of the usual memory reclaim
> > process. Which implies we've run out of free memory, correct?
> 
> Yes but we have already reclaimed some memory.
> 
> > > How do you trigger such a special writeout?
> > 
> > filemap_fdatawrite_range() perhaps?
> 
> Could you provide me such a patch? I would not know how much to writeout. 
> If we had such a method then we could also use that for the swap case 
> where we also write out single pages?

How hard is it? I don't have time right now to do this, but it's essentially:

	mapping = page->mapping;
	......
-	mapping->aops->writepage();
+	filemap_fdatawrite_range(mapping, start, end);

Where [start,end] span page->index and are is large enough
to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
pages, preferrably larger for 4k page size machines).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20  0:25               ` David Chinner
@ 2008-05-20  6:56                 ` Evgeniy Polyakov
  2008-05-20 21:46                   ` David Chinner
  0 siblings, 1 reply; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-20  6:56 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Tue, May 20, 2008 at 10:25:03AM +1000, David Chinner (dgc@sgi.com) wrote:
> +	filemap_fdatawrite_range(mapping, start, end);
> 
> Where [start,end] span page->index and are is large enough
> to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
> pages, preferrably larger for 4k page size machines).

Or just sync_inode().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20  6:56                 ` Evgeniy Polyakov
@ 2008-05-20 21:46                   ` David Chinner
  2008-05-20 22:25                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-20 21:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Chinner, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Tue, May 20, 2008 at 10:56:23AM +0400, Evgeniy Polyakov wrote:
> On Tue, May 20, 2008 at 10:25:03AM +1000, David Chinner (dgc@sgi.com) wrote:
> > +	filemap_fdatawrite_range(mapping, start, end);
> > 
> > Where [start,end] span page->index and are is large enough
> > to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
> > pages, preferrably larger for 4k page size machines).
> 
> Or just sync_inode().

Oh, god no. Let's not put the inode_lock right at the top of
the VM page cleaning path. We don't need to modify inode state,
the superblock dirty lists, etc - all we need to do is write
dirty pages on a given mapping in a more efficient manner.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 21:46                   ` David Chinner
@ 2008-05-20 22:25                     ` Evgeniy Polyakov
  2008-05-20 23:19                       ` David Chinner
  2008-05-20 23:22                       ` [patch 10/21] buffer heads: Support slab defrag Evgeniy Polyakov
  0 siblings, 2 replies; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-20 22:25 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Wed, May 21, 2008 at 07:46:17AM +1000, David Chinner (dgc@sgi.com) wrote:
> Oh, god no. Let's not put the inode_lock right at the top of
> the VM page cleaning path. We don't need to modify inode state,
> the superblock dirty lists, etc - all we need to do is write
> dirty pages on a given mapping in a more efficient manner.

I'm not advocating that, but having swap on reclaim does not hurt
anyone, this is essentially the same, but with different underlying
storage. System will do that anyway sooner or later during usual
writeback, which in turn can be a result of the same reclaim...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-19  5:45           ` David Chinner
  2008-05-19 16:44             ` Christoph Lameter
@ 2008-05-20 22:53             ` Jamie Lokier
  1 sibling, 0 replies; 93+ messages in thread
From: Jamie Lokier @ 2008-05-20 22:53 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

David Chinner wrote:
> > Why are we out of memory?
> 
> Defragmentation is triggered as part of the usual memory reclaim
> process. Which implies we've run out of free memory, correct?

I don't think that's true on no-MMU.  Defragmentation can be needed
often on no-MMU when there's lots of free memory, just in the wrong
places.

-- Jamie

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 22:25                     ` Evgeniy Polyakov
@ 2008-05-20 23:19                       ` David Chinner
  2008-05-20 23:28                         ` Andrew Morton
  2008-05-20 23:22                       ` [patch 10/21] buffer heads: Support slab defrag Evgeniy Polyakov
  1 sibling, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-20 23:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Chinner, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov wrote:
> On Wed, May 21, 2008 at 07:46:17AM +1000, David Chinner (dgc@sgi.com) wrote:
> > Oh, god no. Let's not put the inode_lock right at the top of the VM page
> > cleaning path. We don't need to modify inode state, the superblock dirty
> > lists, etc - all we need to do is write dirty pages on a given mapping in
> > a more efficient manner.
> 
> I'm not advocating that, but having swap on reclaim does not hurt anyone,
> this is essentially the same, but with different underlying storage.

Sure. But my point is simply that sync_inode() is far too
heavy-weight to be used in a reclaim context. The fact that it holds
the inode_lock will interfere with normal writeback via pdflush and
that could potentially slow down writeback even more.

e.g. think of kswapd threads running on 20 nodes of a NUMA machine
all at once writing back dirty memory (yes, it happens). If we use
sync_inode() to write back dirty mappings we would then have at
least 20 CPUs serialising on the inode_lock trying to write back
pages. If we instead use a thin wrapper around ->writepages() then
they can all run in parallel through the filesystem(s), block
devices, etc rather than being serialised at the highest possible
layer....

> System
> will do that anyway sooner or later during usual writeback, which in turn
> can be a result of the same reclaim...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 22:25                     ` Evgeniy Polyakov
  2008-05-20 23:19                       ` David Chinner
@ 2008-05-20 23:22                       ` Evgeniy Polyakov
  2008-05-20 23:30                         ` David Chinner
  2008-05-21  1:56                         ` Christoph Lameter
  1 sibling, 2 replies; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-20 23:22 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Oh, god no. Let's not put the inode_lock right at the top of
> > the VM page cleaning path. We don't need to modify inode state,
> > the superblock dirty lists, etc - all we need to do is write
> > dirty pages on a given mapping in a more efficient manner.
> 
> I'm not advocating that, but having swap on reclaim does not hurt
> anyone, this is essentially the same, but with different underlying
> storage. System will do that anyway sooner or later during usual
> writeback, which in turn can be a result of the same reclaim...

And actually having tiny operations under inode_lock is the last thing
to worry about when we are about to start writing pages to disk because
memory is so fragmented that we need to move things around.

That is the simplest from the typing viewpoint, one can also do
something like that:

struct address_space *mapping = page->mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
struct writeback_control wbc = {
	.bdi = bdi,
	.sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
	.older_than_this = NULL,
	.nr_to_write = 13,
	.range_cyclic = 0,
	.range_start = start_index,
	.range_end = end_index
};

do_writepages(mapping, &wbc);

Cristoph, is this example you wnated to check out? It will only try to
write .nr_to_write pages between .range_start and .range_end without
syncing inode info itself.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 23:19                       ` David Chinner
@ 2008-05-20 23:28                         ` Andrew Morton
  2008-05-21  6:15                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 93+ messages in thread
From: Andrew Morton @ 2008-05-20 23:28 UTC (permalink / raw)
  To: David Chinner
  Cc: Evgeniy Polyakov, Christoph Lameter, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Wed, 21 May 2008 09:19:42 +1000 David Chinner <dgc@sgi.com> wrote:

> sync_inode() is far too
> heavy-weight to be used in a reclaim context

It's more than efficiency.  There are lots and lots of things we cannot
do in direct-reclaim context.

a) Can't lock pages (well we kinda sorta could, but generally code
   will just trylock)

b) Cannot rely on the inode or the address_space being present in
   memory after we have unlocked the page.

c) Cannot run iput().  Or at least, we couldn't five or six years
   ago.  afaik nobody has investigated whether the situation is now
   better or worse.

d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
   in which you share code with normal writeback paths.

Plus e), f), g) and h).  Direct-reclaim is a hostile environment. 
Things like b) are a real killer - nasty, subtle, rare,
memory-pressure-dependent crashes.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 23:22                       ` [patch 10/21] buffer heads: Support slab defrag Evgeniy Polyakov
@ 2008-05-20 23:30                         ` David Chinner
  2008-05-21  6:20                           ` Evgeniy Polyakov
  2008-05-21  1:56                         ` Christoph Lameter
  1 sibling, 1 reply; 93+ messages in thread
From: David Chinner @ 2008-05-20 23:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Chinner, Christoph Lameter, akpm, linux-kernel,
	linux-fsdevel, Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Wed, May 21, 2008 at 03:22:56AM +0400, Evgeniy Polyakov wrote:
> On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > > Oh, god no. Let's not put the inode_lock right at the top of
> > > the VM page cleaning path. We don't need to modify inode state,
> > > the superblock dirty lists, etc - all we need to do is write
> > > dirty pages on a given mapping in a more efficient manner.
> > 
> > I'm not advocating that, but having swap on reclaim does not hurt
> > anyone, this is essentially the same, but with different underlying
> > storage. System will do that anyway sooner or later during usual
> > writeback, which in turn can be a result of the same reclaim...
> 
> And actually having tiny operations under inode_lock is the last thing
> to worry about when we are about to start writing pages to disk because
> memory is so fragmented that we need to move things around.
> 
> That is the simplest from the typing viewpoint, one can also do
> something like that:
> 
> struct address_space *mapping = page->mapping;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> struct writeback_control wbc = {
> 	.bdi = bdi,
> 	.sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
> 	.older_than_this = NULL,
> 	.nr_to_write = 13,
> 	.range_cyclic = 0,
> 	.range_start = start_index,
> 	.range_end = end_index
> };
> 
> do_writepages(mapping, &wbc);

Which is the exact implementation of

	filemap_fdatawrite_range(mapping, start, end);

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 23:22                       ` [patch 10/21] buffer heads: Support slab defrag Evgeniy Polyakov
  2008-05-20 23:30                         ` David Chinner
@ 2008-05-21  1:56                         ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-05-21  1:56 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Chinner, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Wed, 21 May 2008, Evgeniy Polyakov wrote:

> Cristoph, is this example you wnated to check out? It will only try to
> write .nr_to_write pages between .range_start and .range_end without
> syncing inode info itself.

Well that is what Dave wants. I'd rather go the safe route for now and 
defer this until later. I think you are much more an expert on the 
filesystems and I/O paths than I am. So I'd rather take my hands of as 
soon as possible.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 23:28                         ` Andrew Morton
@ 2008-05-21  6:15                           ` Evgeniy Polyakov
  2008-05-21  6:24                             ` Andrew Morton
  0 siblings, 1 reply; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-21  6:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Chinner, Christoph Lameter, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> It's more than efficiency.  There are lots and lots of things we cannot
> do in direct-reclaim context.
> 
> a) Can't lock pages (well we kinda sorta could, but generally code
>    will just trylock)
> 
> b) Cannot rely on the inode or the address_space being present in
>    memory after we have unlocked the page.
> 
> c) Cannot run iput().  Or at least, we couldn't five or six years
>    ago.  afaik nobody has investigated whether the situation is now
>    better or worse.
> 
> d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
>    in which you share code with normal writeback paths.
> 
> Plus e), f), g) and h).  Direct-reclaim is a hostile environment. 
> Things like b) are a real killer - nasty, subtle, rare,
> memory-pressure-dependent crashes.

Which basically means we can not do direct writeback at reclaim time?..

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-20 23:30                         ` David Chinner
@ 2008-05-21  6:20                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-21  6:20 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, akpm, linux-kernel, linux-fsdevel, Mel Gorman,
	andi, Rik van Riel, Pekka Enberg, mpm

On Wed, May 21, 2008 at 09:30:15AM +1000, David Chinner (dgc@sgi.com) wrote:
> Which is the exact implementation of
> 
> 	filemap_fdatawrite_range(mapping, start, end);

Cool, I did not know that, probably because it is not exported :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch 10/21] buffer heads: Support slab defrag
  2008-05-21  6:15                           ` Evgeniy Polyakov
@ 2008-05-21  6:24                             ` Andrew Morton
  2008-05-21 17:52                               ` iput() in reclaim context Hugh Dickins
  0 siblings, 1 reply; 93+ messages in thread
From: Andrew Morton @ 2008-05-21  6:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Chinner, Christoph Lameter, linux-kernel, linux-fsdevel,
	Mel Gorman, andi, Rik van Riel, Pekka Enberg, mpm

On Wed, 21 May 2008 10:15:32 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> > It's more than efficiency.  There are lots and lots of things we cannot
> > do in direct-reclaim context.
> > 
> > a) Can't lock pages (well we kinda sorta could, but generally code
> >    will just trylock)
> > 
> > b) Cannot rely on the inode or the address_space being present in
> >    memory after we have unlocked the page.
> > 
> > c) Cannot run iput().  Or at least, we couldn't five or six years
> >    ago.  afaik nobody has investigated whether the situation is now
> >    better or worse.
> > 
> > d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
> >    in which you share code with normal writeback paths.
> > 
> > Plus e), f), g) and h).  Direct-reclaim is a hostile environment. 
> > Things like b) are a real killer - nasty, subtle, rare,
> > memory-pressure-dependent crashes.
> 
> Which basically means we can not do direct writeback at reclaim time?..
> 

Well, we _can_, but doing so within the present constraints is delicate.

An implementation which locked all the to-be-written pages up front and
then wrote them out and which was careful not to touch the inode or
address_space after the last page is unlocked could work.

Or perhaps add a new lock to the inode and then in reclaim

a) lock a page on the LRU, thus pinning the address_space and inode.

b) take some new sleeping lock in the inode

c) unlock that page and now proceed to do writeback.  But still
   honouring !GFP_FS.

and teach the unmount code to take the per-inode locks too, to ensure
that reclaim has got out of there before zapping the inodes.  Perhaps a
per-superblock lock rather than per-inode, dunno.

But we won't be able to just dive in there and call the existing
writeback functions from within reclaim.  Because

a) callers can hold all sorts of locks, including implicit ones such
   as journal_start() and

b) reclaim doesn't have a reference on the page's inode, and the
   inode and address_space can vanish if reclaim isn't holding a lock
   on one of the address_space's pages.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* iput() in reclaim context
  2008-05-21  6:24                             ` Andrew Morton
@ 2008-05-21 17:52                               ` Hugh Dickins
  2008-05-21 17:58                                 ` Evgeniy Polyakov
  2008-05-21 18:12                                 ` Andrew Morton
  0 siblings, 2 replies; 93+ messages in thread
From: Hugh Dickins @ 2008-05-21 17:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Evgeniy Polyakov, linux-kernel, linux-fsdevel

On Tue, 20 May 2008, Andrew Morton wrote:
> On Wed, 21 May 2008 10:15:32 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> > > It's more than efficiency.  There are lots and lots of things we cannot
> > > do in direct-reclaim context.
> > > 
> > > ...
> > > 
> > > c) Cannot run iput().  Or at least, we couldn't five or six years
> > >    ago.  afaik nobody has investigated whether the situation is now
> > >    better or worse.

I happened to notice your remark in the buffer heads defrag thread.
Do you remember what that limitation was about?

Because just a few months ago I discovered a shmem race which I fixed
by doing igrab+iput in shmem_writepage, in the reclaim context.  Feeling
guilty now: I'd better investigate, but would welcome a starting pointer.

(If I'm lucky, it'll be that the generic code in vmscan.c cannot
use iput, but particular filesystems might themselves be safe to.)

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: iput() in reclaim context
  2008-05-21 17:52                               ` iput() in reclaim context Hugh Dickins
@ 2008-05-21 17:58                                 ` Evgeniy Polyakov
  2008-05-21 18:12                                 ` Andrew Morton
  1 sibling, 0 replies; 93+ messages in thread
From: Evgeniy Polyakov @ 2008-05-21 17:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

Hi Hugh.

On Wed, May 21, 2008 at 06:52:27PM +0100, Hugh Dickins (hugh@veritas.com) wrote:
> I happened to notice your remark in the buffer heads defrag thread.
> Do you remember what that limitation was about?
> 
> Because just a few months ago I discovered a shmem race which I fixed
> by doing igrab+iput in shmem_writepage, in the reclaim context.  Feeling
> guilty now: I'd better investigate, but would welcome a starting pointer.
> 
> (If I'm lucky, it'll be that the generic code in vmscan.c cannot
> use iput, but particular filesystems might themselves be safe to.)

If we are talking about the same things, its waiting for pages to be
synced (wither written back or truncated) when inode is about to be
destroyed. Thus reclaim can sleep wating for pages to be synced, which
it is about to move somewhere itself. Deadlock. The same for writepage -
if we drop inode there it can wait for pages to be synced, which inturn
requires writeback, where we are sleeping already...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: iput() in reclaim context
  2008-05-21 17:52                               ` iput() in reclaim context Hugh Dickins
  2008-05-21 17:58                                 ` Evgeniy Polyakov
@ 2008-05-21 18:12                                 ` Andrew Morton
  1 sibling, 0 replies; 93+ messages in thread
From: Andrew Morton @ 2008-05-21 18:12 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Evgeniy Polyakov, linux-kernel, linux-fsdevel

On Wed, 21 May 2008 18:52:27 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 20 May 2008, Andrew Morton wrote:
> > On Wed, 21 May 2008 10:15:32 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> > > > It's more than efficiency.  There are lots and lots of things we cannot
> > > > do in direct-reclaim context.
> > > > 
> > > > ...
> > > > 
> > > > c) Cannot run iput().  Or at least, we couldn't five or six years
> > > >    ago.  afaik nobody has investigated whether the situation is now
> > > >    better or worse.
> 
> I happened to notice your remark in the buffer heads defrag thread.
> Do you remember what that limitation was about?

Ages and ages ago.  I expect it was a deadlock thing.  iput_final() can
end up calling things like write_inode() which can want to do things
like opening a transaction against filesystem A while already having
one open against filesystem B.  Which is both deadlockable and BUGable.
It will take other embarrassing locks too, probably.

> Because just a few months ago I discovered a shmem race which I fixed
> by doing igrab+iput in shmem_writepage, in the reclaim context.  Feeling
> guilty now: I'd better investigate, but would welcome a starting pointer.
> 
> (If I'm lucky, it'll be that the generic code in vmscan.c cannot
> use iput, but particular filesystems might themselves be safe to.)

Yes, it was specific to the direct-reclaim calling context.

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2008-05-21 18:14 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-10  3:08 [patch 00/21] Slab Fragmentation Reduction V12 Christoph Lameter
2008-05-10  3:08 ` [patch 01/21] slub: Add defrag_ratio field and sysfs support Christoph Lameter
2008-05-10  3:08 ` [patch 02/21] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
2008-05-10  3:08 ` [patch 03/21] slub: Add get() and kick() methods Christoph Lameter
2008-05-10  3:08 ` [patch 04/21] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2008-05-10  3:08 ` [patch 05/21] slub: Slab defrag core Christoph Lameter
2008-05-10  3:08 ` [patch 06/21] slub: Add KICKABLE to avoid repeated kick() attempts Christoph Lameter
2008-05-10  3:08 ` [patch 07/21] slub: Extend slabinfo to support -D and -F options Christoph Lameter
2008-05-10  3:08 ` [patch 08/21] slub: add defrag statistics Christoph Lameter
2008-05-10  3:08 ` [patch 09/21] slub: Trigger defragmentation from memory reclaim Christoph Lameter
2008-05-10  3:08 ` [patch 10/21] buffer heads: Support slab defrag Christoph Lameter
2008-05-12  0:24   ` David Chinner
2008-05-15 17:42     ` Christoph Lameter
2008-05-15 23:10       ` David Chinner
2008-05-16 17:01         ` Christoph Lameter
2008-05-19  5:45           ` David Chinner
2008-05-19 16:44             ` Christoph Lameter
2008-05-20  0:25               ` David Chinner
2008-05-20  6:56                 ` Evgeniy Polyakov
2008-05-20 21:46                   ` David Chinner
2008-05-20 22:25                     ` Evgeniy Polyakov
2008-05-20 23:19                       ` David Chinner
2008-05-20 23:28                         ` Andrew Morton
2008-05-21  6:15                           ` Evgeniy Polyakov
2008-05-21  6:24                             ` Andrew Morton
2008-05-21 17:52                               ` iput() in reclaim context Hugh Dickins
2008-05-21 17:58                                 ` Evgeniy Polyakov
2008-05-21 18:12                                 ` Andrew Morton
2008-05-20 23:22                       ` [patch 10/21] buffer heads: Support slab defrag Evgeniy Polyakov
2008-05-20 23:30                         ` David Chinner
2008-05-21  6:20                           ` Evgeniy Polyakov
2008-05-21  1:56                         ` Christoph Lameter
2008-05-20 22:53             ` Jamie Lokier
2008-05-10  3:08 ` [patch 11/21] inodes: Support generic defragmentation Christoph Lameter
2008-05-10  3:08 ` [patch 12/21] Filesystem: Ext2 filesystem defrag Christoph Lameter
2008-05-10  3:08 ` [patch 13/21] Filesystem: Ext3 " Christoph Lameter
2008-05-10  3:08 ` [patch 14/21] Filesystem: Ext4 " Christoph Lameter
2008-05-10  3:08 ` [patch 15/21] Filesystem: XFS slab defragmentation Christoph Lameter
2008-05-10  6:55   ` Christoph Hellwig
2008-05-10  3:08 ` [patch 16/21] Filesystem: /proc filesystem support for slab defrag Christoph Lameter
2008-05-10  3:08 ` [patch 17/21] Filesystem: Slab defrag: Reiserfs support Christoph Lameter
2008-05-10  3:08 ` [patch 18/21] Filesystem: Socket inode defragmentation Christoph Lameter
2008-05-13 13:28   ` Evgeniy Polyakov
2008-05-15 17:40     ` Christoph Lameter
2008-05-15 18:23       ` Evgeniy Polyakov
2008-05-10  3:08 ` [patch 19/21] dentries: Add constructor Christoph Lameter
2008-05-10  3:08 ` [patch 20/21] dentries: dentry defragmentation Christoph Lameter
2008-05-10  3:08 ` [patch 21/21] slab defrag: Obsolete SLAB Christoph Lameter
2008-05-10  9:53   ` Andi Kleen
2008-05-11  2:15     ` Rik van Riel
2008-05-12  7:38       ` KOSAKI Motohiro
2008-05-12  7:54         ` Pekka Enberg
2008-05-12 10:08           ` Andi Kleen
2008-05-12 10:23             ` Pekka Enberg
2008-05-14 17:30               ` Christoph Lameter
2008-05-14 17:29           ` Christoph Lameter
2008-05-14 17:49             ` Andi Kleen
2008-05-14 18:03               ` Christoph Lameter
2008-05-14 18:18                 ` Matt Mackall
2008-05-14 19:21                   ` Christoph Lameter
2008-05-14 19:49                     ` Matt Mackall
2008-05-14 20:33                       ` Christoph Lameter
2008-05-14 21:02                         ` Matt Mackall
2008-05-14 21:26                           ` Christoph Lameter
2008-05-14 21:54                             ` Matt Mackall
2008-05-15 17:15                               ` Christoph Lameter
2008-05-15  3:26                 ` Zhang, Yanmin
2008-05-15 17:05                   ` Christoph Lameter
2008-05-15 17:49                     ` Matthew Wilcox
2008-05-15 17:58                       ` Christoph Lameter
2008-05-15 18:13                         ` Matthew Wilcox
2008-05-15 18:43                           ` Christoph Lameter
2008-05-15 18:51                             ` Matthew Wilcox
2008-05-15 19:09                               ` Christoph Lameter
2008-05-15 19:29                                 ` Matthew Wilcox
2008-05-15 20:14                                   ` Matthew Wilcox
2008-05-15 20:30                                     ` Pekka Enberg
2008-05-16 19:17                                     ` Christoph Lameter
2008-05-16 19:06                                   ` Christoph Lameter
2008-05-15 18:19                       ` Eric Dumazet
2008-05-15 18:29                       ` Vegard Nossum
2008-05-16  5:16                     ` Zhang, Yanmin
2008-05-14 18:05               ` Christoph Lameter
2008-05-14 20:46                 ` Christoph Lameter
2008-05-14 20:58                   ` Matthew Wilcox
2008-05-14 21:00                     ` Christoph Lameter
2008-05-14 21:21                       ` Matthew Wilcox
2008-05-14 21:33                         ` Christoph Lameter
2008-05-14 21:43                           ` Matthew Wilcox
2008-05-14 21:53                             ` Christoph Lameter
2008-05-14 22:00                               ` Matthew Wilcox
2008-05-14 22:32                                 ` Christoph Lameter
2008-05-14 22:34                                 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).