[patch 0/3] slub partial list thrashing performance degradation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/3] slub partial list thrashing performance degradation
@ 2009-03-30  5:43 David Rientjes
  2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
  2009-03-30  6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
  0 siblings, 2 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30  5:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

SLUB causes a performance degradation in comparison to SLAB when a
workload has an object allocation and freeing pattern such that it spends
more time in partial list handling than utilizing the fastpaths.

This usually occurs when freeing to a non-cpu slab either due to remote
cpu freeing or freeing to a full or partial slab.  When the cpu slab is
later replaced with the freeing slab, it can only satisfy a limited
number of allocations before becoming full and requiring additional
partial list handling.

When the slowpath to fastpath ratio becomes high, this partial list
handling causes the entire allocator to become very slow for the specific
workload.

The bash script at the end of this email (inline) illustrates the
performance degradation well.  It uses the netperf TCP_RR benchmark to
measure transfer rates with various thread counts, each being multiples
of the number of cores.  The transfer rates are reported as an aggregate
of the individual thread results.

CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are
performing quite poorly:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

	cache		FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	173624		129538000
	kmalloc-2048	90520		129500630

The majority of slowpath allocations were from the partial list
(30786261, or 97.5%, for kmalloc-256 and 51688159, or 98.7%, for
kmalloc-2048).

A large percentage of frees required the slab to be added back to the
partial list.  For kmalloc-256, 30786630 (23.8%) of slowpath frees
required partial list handling.  For kmalloc-2048, 51688697 (39.9%) of
slowpath frees required partial list handling.

On my 16-core machines with 64G of ram, these are the results:

	# threads	SLAB		SLUB		SLUB+patchset
	16		69892		71592		69505
	32		126490		95373		119731
	48		138050		113072		125014
	64		169240		149043		158919
	80		192294		172035		179679
	96		197779		187849		192154
	112		217283		204962		209988
	128		229848		217547		223507
	144		238550		232369		234565
	160		250333		239871		244789
	176		256878		242712		248971
	192		261611		243182		255596

 [ The SLUB+patchset results were attained with the latest git plus this
   patchset and slab_thrash_ratio set at 20 for both the kmalloc-256 and
   the kmalloc-2048 cache. ]

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/slub_def.h |    4 +
 mm/slub.c                |  138 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 122 insertions(+), 20 deletions(-)

#!/bin/bash

TIME=60				# seconds
HOSTNAME=<hostname>		# netserver

NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS

run_netperf() {
	for i in $(seq 1 $1); do
		netperf -H $HOSTNAME -t TCP_RR -l $TIME &
	done
}

ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
	RATE=0
	ITERATIONS=$[$ITERATIONS + 1]	
	THREADS=$[$NR_CPUS * $ITERATIONS]
	RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')

	for j in $RESULTS; do
		RATE=$[$RATE + ${j/.*}]
	done
	echo threads=$THREADS rate=$RATE
done

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30  5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
@ 2009-03-30  5:43 ` David Rientjes
  2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
                     ` (2 more replies)
  2009-03-30  6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
  1 sibling, 3 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30  5:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the
percentage of a slab's objects that the fastpath must fulfill to not be
considered thrashing on a per-cpu basis[*].

"Thrashing" here is defined as the constant swapping of the cpu slab such
that the slowpath is followed the majority of the time because the
refilled cpu slab can only accommodate a small number of allocations.
This occurs when the object allocation and freeing pattern for a cache is
such that it spends more time swapping the cpu slab than fulfulling
fastpath allocations.

 [*] A single instance of the thrash ratio not being reached in the
     fastpath does not indicate the cpu cache is thrashing.  A
     pre-defined value will later be added to determine how many times
     the ratio must not be reached before a cache is actually thrashing.

This is defined as a ratio based on the number of objects in a cache's
slab.  This is automatically changed when /sys/kernel/slab/cache/order is
changed to reflect the same ratio.

The netperf TCP_RR benchmark illustrates slab thrashing very well with a
large number of threads.  With a test length of 60 seconds, the following
thread counts were used to show the effect of the allocation and freeing
pattern of such a workload.

Before this patchset:

	threads			Transfer Rate (per sec)
	16			71592
	32			95373
	48			113072
	64			149043
	80			172035
	96			187849
	112			204962
	128			217547
	144			232369
	160			239871
	176			242712
	192			243182

To identify the thrashing caches, the same workload was run with
CONFIG_SLUB_STATS enabled.  The following caches are obviously performing
very poorly:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

	cache		FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	173624		129538000
	kmalloc-2048	90520		129500630

After this patchset (both caches with slab_thrash_ratios of 20):

	threads			Transfer Rate (per sec)
	16			69505
	32			119731
	48			125014
	64			158919
	80			179679
	96			192154
	112			209988
	128			223507
	144			234565
	160			244789
	176			248971
	192			255596

Although slabs may accommodate fewer objects than others when contiguous
memory cannot be allocated for a cache's order, the ratio is still based
on its configured `order' since slabs will exist on the partial list that
will be able to fulfill such a requirement.

The value is stored in terms of the number of objects that the ratio
represents, not the ratio itself.  This avoids costly arithmetic in the
slowpath for a calculation that could otherwise be done only when
`slab_thrash_ratio' or `order' is changed.

This also will adjust the configured ratio to one that can actually be
represented in terms of whole numbers: for example, if slab_thrash_ratio
is set to 20 for a cache with 64 objects, the effective ratio is actually
3:16 (or 18.75%).  This will be shown when reading the ratio since it is
better to represent the actual ratio instead of a pseudo substitute.

The slab_thrash_ratio for each cache do not have non-zero defaults
(yet?).

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |   29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -94,6 +94,7 @@ struct kmem_cache {
 #ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
 #endif
+	u16 min_free_watermark;	/* Calculated from slab thrash ratio */

 #ifdef CONFIG_NUMA
 	/*
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2186,6 +2186,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	unsigned long flags = s->flags;
 	unsigned long size = s->objsize;
 	unsigned long align = s->align;
+	u16 thrash_ratio = 0;
 	int order;

 	/*
@@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	/*
 	 * Determine the number of objects per slab
 	 */
+	if (oo_objects(s->oo))
+		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
 	s->oo = oo_make(order, size);
 	s->min = oo_make(get_order(size), size);
 	if (oo_objects(s->oo) > oo_objects(s->max))
 		s->max = s->oo;
+	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;

 	return !!oo_objects(s->oo);

@@ -2321,6 +2325,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 	 */
 	set_min_partial(s, ilog2(s->size));
 	s->refcount = 1;
+	s->min_free_watermark = 0;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -4110,6 +4115,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
 SLAB_ATTR(remote_node_defrag_ratio);
 #endif

+static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n",
+		       s->min_free_watermark * 100 / oo_objects(s->oo));
+}
+
+static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf,
+				       size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = strict_strtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio <= 100)
+		s->min_free_watermark = oo_objects(s->oo) * ratio / 100;
+
+	return length;
+}
+SLAB_ATTR(slab_thrash_ratio);
+
 #ifdef CONFIG_SLUB_STATS
 static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
 {
@@ -4194,6 +4222,7 @@ static struct attribute *slab_attrs[] = {
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&slab_thrash_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
@ 2009-03-30  5:43   ` David Rientjes
  2009-03-30  5:43     ` [patch 3/3] slub: sort parital list " David Rientjes
  2009-03-30 14:37     ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
  2009-03-30  7:11   ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
  2009-03-30 14:30   ` Christoph Lameter
  2 siblings, 2 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30  5:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

To determine when a slab is actually thrashing, it's insufficient to only
look at the most recent allocation path.  It's perfectly valid to swap
the cpu slab with a partial slab that contains very few free objects if
the goal is to quickly fill it since slub no longer needs to track such
slabs.

This is inefficient if an object will immediately be freed so that the
full slab must be readded to the partial list.  With certain object
allocation and freeing patterns, it is possible to spend more time
processing the partial list than utilizing the fastpaths.

We already have a per-cache min_free_watermark setting that is
configurable from userspace, which helps determine when we have excessive
partial list handling.  When a slab does not fulfill its watermark, it
suggests that the cache may be thrashing.  A pre-defined value,
SLAB_THRASHING_THRESHOLD (which defaults to 3), is implemented to be used
in conjunction with this statistic to determine when a slab is actually
thrashing.

Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
is incrememted.  This counter is cleared whenever the slowpath is
invoked.  This tracks how many fastpath allocations the cpu slab has
fulfilled before it must be refilled.

When the slowpath must be invoked, a slowpath counter is incremented if
the cpu slab did not fulfill the thrashing watermark.  Otherwise, it is
decremented.

When the slowpath counter is greater than or equal to
SLAB_THRASHING_THRESHOLD, the partial list is scanned for a slab that
will be able to fulfill at least the number of objects required to not
be considered thrashing.  If no such slabs are available, the remote
nodes are defragmented (if allowed) or a new slab is allocated.

If a cpu slab must be swapped because the allocation is for a different
node, both counters are cleared since this doesn't indicate any
thrashing behavior.

When /sys/kernel/slab/cache/slab_thrash_ratio is not set, this does not
include any functional change other than the incrementing of a fastpath
counter for the per-cpu cache.

A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how
many times a partial list was deferred because no slabs could satisfy
the requisite number of objects for CONFIG_SLUB_STATS kernels.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/slub_def.h |    3 +
 mm/slub.c                |   93 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 76 insertions(+), 20 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -30,6 +30,7 @@ enum stat_item {
 	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
+	DEFERRED_PARTIAL,	/* Defer local partial list for lack of objs */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
@@ -38,6 +39,8 @@ struct kmem_cache_cpu {
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */
+	u16 fastpath_allocs;	/* Consecutive fast allocs before slowpath */
+	u16 slowpath_allocs;	/* Consecutive slow allocs before watermark */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,19 @@
  */
 #define MAX_PARTIAL 10
 
+/*
+ * Number of successive slowpath allocations that have failed to allocate at
+ * least the number of objects in the fastpath to not be slab thrashing (as
+ * defined by the cache's slab thrash ratio).
+ *
+ * When an allocation follows the slowpath, it increments a counter in its cpu
+ * cache.  If this counter exceeds the threshold, the partial list is scanned
+ * for a slab that will satisfy at least the cache's min_free_watermark in
+ * order for it to be used.  Otherwise, the slab with the most free objects is
+ * used.
+ */
+#define SLAB_THRASHING_THRESHOLD 3
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1246,28 +1259,30 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 }
 
 /*
- * Lock slab and remove from the partial list.
+ * Remove from the partial list.
  *
- * Must hold list_lock.
+ * Must hold n->list_lock and slab_lock(page).
  */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
-							struct page *page)
+static inline void freeze_slab(struct kmem_cache_node *n, struct page *page)
 {
-	if (slab_trylock(page)) {
-		list_del(&page->lru);
-		n->nr_partial--;
-		__SetPageSlubFrozen(page);
-		return 1;
-	}
-	return 0;
+	list_del(&page->lru);
+	n->nr_partial--;
+	__SetPageSlubFrozen(page);
+}
+
+static inline int skip_partial(struct kmem_cache *s, struct page *page)
+{
+	return (page->objects - page->inuse) < s->min_free_watermark;
 }
 
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_node(struct kmem_cache *s,
+				     struct kmem_cache_node *n, int thrashing)
 {
 	struct page *page;
+	int locked = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -1280,9 +1295,28 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
 
 	spin_lock(&n->list_lock);
 	list_for_each_entry(page, &n->partial, lru)
-		if (lock_and_freeze_slab(n, page))
+		if (slab_trylock(page)) {
+			/*
+			 * When the cpu cache is partial list thrashing, it's
+			 * necessary to replace the cpu slab with one that will
+			 * accommodate at least s->min_free_watermark objects
+			 * to avoid excessive list_lock contention and cache
+			 * polluting.
+			 *
+			 * If no such slabs exist on the partial list, remote
+			 * nodes are defragmented if allowed.
+			 */
+			if (thrashing && skip_partial(s, page)) {
+				slab_unlock(page);
+				locked++;
+				continue;
+			}
+			freeze_slab(n, page);
 			goto out;
+		}
 	page = NULL;
+	if (locked)
+		stat(get_cpu_slab(s, raw_smp_processor_id()), DEFERRED_PARTIAL);
 out:
 	spin_unlock(&n->list_lock);
 	return page;
@@ -1291,7 +1325,8 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
+				    int thrashing)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1330,7 +1365,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > s->min_partial) {
-			page = get_partial_node(n);
+			page = get_partial_node(s, n, thrashing);
 			if (page)
 				return page;
 		}
@@ -1342,16 +1377,17 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 /*
  * Get a partial page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node,
+				int thrashing)
 {
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	page = get_partial_node(s, get_node(s, searchnode), thrashing);
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial(s, flags, thrashing);
 }
 
 /*
@@ -1503,6 +1539,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
+	int is_empty = 0;
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1511,7 +1548,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	is_empty = node_match(c, node);
+	if (unlikely(!is_empty))
 		goto another_slab;
 
 	stat(c, ALLOC_REFILL);
@@ -1536,7 +1574,17 @@ another_slab:
 	deactivate_slab(s, c);
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
+	if (is_empty) {
+		if (c->fastpath_allocs < s->min_free_watermark)
+			c->slowpath_allocs++;
+		else if (c->slowpath_allocs)
+			c->slowpath_allocs--;
+	} else
+		c->slowpath_allocs = 0;
+	c->fastpath_allocs = 0;
+
+	new = get_partial(s, gfpflags, node,
+			  c->slowpath_allocs > SLAB_THRASHING_THRESHOLD);
 	if (new) {
 		c->page = new;
 		stat(c, ALLOC_FROM_PARTIAL);
@@ -1605,6 +1653,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	else {
 		object = c->freelist;
 		c->freelist = object[c->offset];
+		c->fastpath_allocs++;
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1917,6 +1966,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 	c->node = 0;
 	c->offset = s->offset / sizeof(void *);
 	c->objsize = s->objsize;
+	c->fastpath_allocs = 0;
+	c->slowpath_allocs = 0;
 #ifdef CONFIG_SLUB_STATS
 	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
 #endif
@@ -4193,6 +4244,7 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
 STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(DEFERRED_PARTIAL, deferred_partial);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4248,6 +4300,7 @@ static struct attribute *slab_attrs[] = {
 	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&order_fallback_attr.attr,
+	&deferred_partial_attr.attr,
 #endif
 	NULL
 };

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/3] slub: sort parital list when thrashing
  2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
@ 2009-03-30  5:43     ` David Rientjes
  2009-03-30 14:41       ` Christoph Lameter
  2009-03-30 14:37     ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30  5:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

Caches that are cpu slab thrashing will scan their entire partial list
until a slab is found that will satisfy at least the requisite number of
allocations so that it will not be considered thrashing (as defined by
/sys/kernel/slab/cache/slab_thrash_ratio).

The partial list can be extremely long and its scanning requires that
list_lock is held for that particular node.  This can be inefficient if
slabs at the head of the list are not appropriate cpu slab replacements.

When an object is freed, the number of free objects for its slab is
calculated if the cpu cache is currently thrashing.  If it can satisfy
the requisite number of allocations so that the slab thrash ratio is
exceeded, it is moved to the head of the partial list.  This minimizes
the time spent holding list_lock and can help cacheline optimizations
for recently freed objects.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/slub.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1258,6 +1258,13 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 	spin_unlock(&n->list_lock);
 }

+static void move_partial_to_head(struct kmem_cache_node *n, struct page *page)
+{
+	spin_lock(&n->list_lock);
+	list_move(&page->lru, &n->partial);
+	spin_unlock(&n->list_lock);
+}
+
 /*
  * Remove from the partial list.
  *
@@ -1720,6 +1727,15 @@ checks_ok:
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
 		stat(c, FREE_ADD_PARTIAL);
+	} else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
+		/*
+		 * If the cache is actively slab thrashing, it's necessary to
+		 * move partial slabs to the head of the list so there isn't
+		 * excessive partial list scanning while holding list_lock.
+		 */
+		if (!skip_partial(s, page))
+			move_partial_to_head(get_node(s, page_to_nid(page)),
+					     page);
 	}

 out_unlock:

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] slub: sort parital list when thrashing
  2009-03-30  5:43     ` [patch 3/3] slub: sort parital list " David Rientjes
@ 2009-03-30 14:41       ` Christoph Lameter
  2009-03-30 20:29         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Sun, 29 Mar 2009, David Rientjes wrote:

> @@ -1720,6 +1727,15 @@ checks_ok:
>  	if (unlikely(!prior)) {
>  		add_partial(get_node(s, page_to_nid(page)), page, 1);
>  		stat(c, FREE_ADD_PARTIAL);
> +	} else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
> +		/*
> +		 * If the cache is actively slab thrashing, it's necessary to
> +		 * move partial slabs to the head of the list so there isn't
> +		 * excessive partial list scanning while holding list_lock.
> +		 */
> +		if (!skip_partial(s, page))
> +			move_partial_to_head(get_node(s, page_to_nid(page)),
> +					     page);
>  	}
>
>  out_unlock:
>

This again adds code to a pretty hot path.

What is the impact of the additional hot path code?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] slub: sort parital list when thrashing
  2009-03-30 14:41       ` Christoph Lameter
@ 2009-03-30 20:29         ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, Christoph Lameter wrote:

> > @@ -1720,6 +1727,15 @@ checks_ok:
> >  	if (unlikely(!prior)) {
> >  		add_partial(get_node(s, page_to_nid(page)), page, 1);
> >  		stat(c, FREE_ADD_PARTIAL);
> > +	} else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
> > +		/*
> > +		 * If the cache is actively slab thrashing, it's necessary to
> > +		 * move partial slabs to the head of the list so there isn't
> > +		 * excessive partial list scanning while holding list_lock.
> > +		 */
> > +		if (!skip_partial(s, page))
> > +			move_partial_to_head(get_node(s, page_to_nid(page)),
> > +					     page);
> >  	}
> >
> >  out_unlock:
> >
> 
> This again adds code to a pretty hot path.
> 
> What is the impact of the additional hot path code?
> 

I'll be collecting more data now that there's a general desire for a 
default slab_thrash_ratio value, so we'll implicitly see the performance 
degradation for non-thrashing slab caches.  Mel had suggested a couple of 
benchmarks to try and my hypothesis is that they won't regress with a 
default ratio of 20 for all caches with >= 20 objects per slab (at least a 
4 object threshold for determining thrash).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
  2009-03-30  5:43     ` [patch 3/3] slub: sort parital list " David Rientjes
@ 2009-03-30 14:37     ` Christoph Lameter
  2009-03-30 20:22       ` David Rientjes
  2009-03-31  7:13       ` Pekka Enberg
  1 sibling, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:37 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Sun, 29 Mar 2009, David Rientjes wrote:

> Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> is incrememted.  This counter is cleared whenever the slowpath is
> invoked.  This tracks how many fastpath allocations the cpu slab has
> fulfilled before it must be refilled.

That adds fastpath overhead and it shows for small objects in your tests.

> A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how
> many times a partial list was deferred because no slabs could satisfy
> the requisite number of objects for CONFIG_SLUB_STATS kernels.

Interesting approach.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-30 14:37     ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
@ 2009-03-30 20:22       ` David Rientjes
  2009-03-30 21:20         ` Christoph Lameter
  2009-03-31  7:13       ` Pekka Enberg
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, Christoph Lameter wrote:

> > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > is incrememted.  This counter is cleared whenever the slowpath is
> > invoked.  This tracks how many fastpath allocations the cpu slab has
> > fulfilled before it must be refilled.
> 
> That adds fastpath overhead and it shows for small objects in your tests.
> 

Indeed, which is unavoidable in this case.  The only other way of tracking 
the "thrashing history" I can think of would be bitshifting a 1 for 
slowpath and 0 for fastpath, for example, into an unsigned long.  That, 
however, requires a hamming weight calculation in the slowpath and doesn't 
scale nearly as well as simply an incrementing a counter.

If there's other approaches on tracking such instances, I'd be interested 
to hear them.

Btw, is cl@linux.com your new email address or are all 
linux-foundation.org emails going to eventually migrate to the new domain?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-30 20:22       ` David Rientjes
@ 2009-03-30 21:20         ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 21:20 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, David Rientjes wrote:

> Btw, is cl@linux.com your new email address or are all
> linux-foundation.org emails going to eventually migrate to the new domain?

Dont know. I just got the new email address (from the LF) and its shorter
than linux-foundation.org so I started using it.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-30 14:37     ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
  2009-03-30 20:22       ` David Rientjes
@ 2009-03-31  7:13       ` Pekka Enberg
  2009-03-31  8:23         ` David Rientjes
  2009-03-31 13:23         ` Christoph Lameter
  1 sibling, 2 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-31  7:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel

On Sun, 29 Mar 2009, David Rientjes wrote:
> > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > is incrememted.  This counter is cleared whenever the slowpath is
> > invoked.  This tracks how many fastpath allocations the cpu slab has
> > fulfilled before it must be refilled.

On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> That adds fastpath overhead and it shows for small objects in your tests.

Yup, and looking at this:

+       u16 fastpath_allocs;    /* Consecutive fast allocs before slowpath */
+       u16 slowpath_allocs;    /* Consecutive slow allocs before watermark */

How much do operations on u16 hurt on, say, x86-64? It's nice that
sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
have bigger cache lines, the types could be wider.

Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
btw?

			Pekka


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-31  7:13       ` Pekka Enberg
@ 2009-03-31  8:23         ` David Rientjes
  2009-03-31  8:49           ` Pekka Enberg
  2009-03-31 13:23         ` Christoph Lameter
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31  8:23 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

On Tue, 31 Mar 2009, Pekka Enberg wrote:

> On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > That adds fastpath overhead and it shows for small objects in your tests.
> 
> Yup, and looking at this:
> 
> +       u16 fastpath_allocs;    /* Consecutive fast allocs before slowpath */
> +       u16 slowpath_allocs;    /* Consecutive slow allocs before watermark */
> 
> How much do operations on u16 hurt on, say, x86-64?

As opposed to unsigned int?  These simply use the word variations of the 
mov, test, cmp, and inc instructions instead of long.  It's the same 
tradeoff when using the u16 slub fields within struct page except it's not 
strictly required in this instance because of size limitations, but rather 
for cacheline optimization.

> It's nice that
> sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
> have bigger cache lines, the types could be wider.
> 

Right, this would not change the unpacked size of the struct whereas using 
unsigned int would.

Since MAX_OBJS_PER_PAGE (which should really be renamed MAX_OBJS_PER_SLAB) 
ensures there is no overflow for u16 types, the only time fastpath_allocs 
would need to be wider is when the object size is sufficiently small and 
there had been frees to the cpu slab so that it overflows.  In this 
circumstance, slowpath_allocs would simply be incremented and it would be 
corrected the next time a cpu slab does allocate beyond the threshold 
(SLAB_THRASHING_THRESHOLD should never be 1).  The chance of reaching the 
threshold on successive fastpath counter overflows grows exponentially.

And since slowpath_allocs will never overflow because it's capped at 
SLAB_THRASHING_THRESHOLD + 1 (the cpu slab will be refilled with a slab 
that will ensure slowpath_allocs will be decremented the next time the 
slowpath is invoked), overflow isn't an immediate problem with either.

> Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
> btw?
> 

This was removed in 4c93c355d5d563f300df7e61ef753d7a064411e9.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-31  8:23         ` David Rientjes
@ 2009-03-31  8:49           ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-31  8:49 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

Hi David,

On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > > That adds fastpath overhead and it shows for small objects in your tests.

On Tue, 31 Mar 2009, Pekka Enberg wrote:
> > Yup, and looking at this:
> > 
> > +       u16 fastpath_allocs;    /* Consecutive fast allocs before slowpath */
> > +       u16 slowpath_allocs;    /* Consecutive slow allocs before watermark */
> > 
> > How much do operations on u16 hurt on, say, x86-64?

On Tue, 2009-03-31 at 01:23 -0700, David Rientjes wrote:
> As opposed to unsigned int?  These simply use the word variations of the 
> mov, test, cmp, and inc instructions instead of long.  It's the same 
> tradeoff when using the u16 slub fields within struct page except it's not 
> strictly required in this instance because of size limitations, but rather 
> for cacheline optimization.

I was thinking of partial register stalls. But looking at it on x86-64,
the generated asm seems sane. I see tons of branch instructions, though,
so simplifying this somehow:

+       if (is_empty) {
+               if (c->fastpath_allocs < s->min_free_watermark)
+                       c->slowpath_allocs++;
+               else if (c->slowpath_allocs)
+                       c->slowpath_allocs--;
+       } else
+               c->slowpath_allocs = 0;
+       c->fastpath_allocs = 0;

would be most welcome.

			Pekka


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-31  7:13       ` Pekka Enberg
  2009-03-31  8:23         ` David Rientjes
@ 2009-03-31 13:23         ` Christoph Lameter
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 13:23 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel

On Tue, 31 Mar 2009, Pekka Enberg wrote:

> On Sun, 29 Mar 2009, David Rientjes wrote:
> > > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > > is incrememted.  This counter is cleared whenever the slowpath is
> > > invoked.  This tracks how many fastpath allocations the cpu slab has
> > > fulfilled before it must be refilled.
>
> On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > That adds fastpath overhead and it shows for small objects in your tests.
>
> Yup, and looking at this:
>
> +       u16 fastpath_allocs;    /* Consecutive fast allocs before slowpath */
> +       u16 slowpath_allocs;    /* Consecutive slow allocs before watermark */
>
> How much do operations on u16 hurt on, say, x86-64? It's nice that
> sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
> have bigger cache lines, the types could be wider.
>
> Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
> btw?

Because it is either allocated using kmalloc and aligned to a cacheline
boundary there or the kmem_cache_cpu entries come from the percpu
definition for kmem_cache_cpu. There we dont need cacheline alignment
since they are tightly packet. If the cacheline size is 64 bit then
neighboring kmem_cache_cpus fit into one cacheline which reduces cache
footprint and increased cache hotness.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
  2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
@ 2009-03-30  7:11   ` Pekka Enberg
  2009-03-30  8:41     ` David Rientjes
  2009-03-30 15:54     ` Mel Gorman
  2009-03-30 14:30   ` Christoph Lameter
  2 siblings, 2 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-30  7:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman

On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> The slab_thrash_ratio for each cache do not have non-zero defaults
> (yet?).

If we're going to merge this code, I think it would be better to put a
non-zero default there; otherwise we won't be able to hit potential
performance regressions or bugs. Furthermore, the optimization is not
very useful on large scale if people need to enable it themselves.

Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
a difference? I'm cc'ing Mel in case he has some suggestions how to test
it.

			Pekka

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30  7:11   ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
@ 2009-03-30  8:41     ` David Rientjes
  2009-03-30 15:54     ` Mel Gorman
  1 sibling, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30  8:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman

On Mon, 30 Mar 2009, Pekka Enberg wrote:

> > The slab_thrash_ratio for each cache do not have non-zero defaults
> > (yet?).
> 
> If we're going to merge this code, I think it would be better to put a
> non-zero default there; otherwise we won't be able to hit potential
> performance regressions or bugs. Furthermore, the optimization is not
> very useful on large scale if people need to enable it themselves.
> 
> Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
> a difference? I'm cc'ing Mel in case he has some suggestions how to test
> it.
> 

It won't cause a regression if sane SLAB_THRASHING_THRESHOLD and 
slab_thrash_ratio values are set since the contention on list_lock will 
always be slower than utilizing a more free cpu slab when its thrashing.

I agree that there should be a default value and I was originally going 
to propose the following as the fourth patch in the series, but I wanted 
to generate commentary on the approach first and there's always a 
hesitation when changing the default behavior of the entire allocator for 
workloads with very specific behavior that trigger this type of problem.

The fact that we need a tunable for this is unfortunate, but there doesn't 
seem to be any other way to detect such situations and adjust the partial 
list handling so that list_lock isn't contended so much and the allocation 
slowpath to fastpath ratio isn't so high.

I'd be interested to hear other people's approaches.
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -147,6 +147,12 @@
  */
 #define SLAB_THRASHING_THRESHOLD 3

+/*
+ * Default slab thrash ratio, used to define when a slab is thrashing for a
+ * particular cpu.
+ */
+#define DEFAULT_SLAB_THRASH_RATIO 20
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)

@@ -2392,7 +2398,14 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 	 */
 	set_min_partial(s, ilog2(s->size));
 	s->refcount = 1;
-	s->min_free_watermark = 0;
+	s->min_free_watermark = oo_objects(s->oo) *
+					DEFAULT_SLAB_THRASH_RATIO / 100;
+	/*
+	 * It doesn't make sense to define a slab as thrashing if its threshold
+	 * is fewer than 4 objects.
+	 */
+	if (s->min_free_watermark < 4)
+		s->min_free_watermark = 0;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30  7:11   ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
  2009-03-30  8:41     ` David Rientjes
@ 2009-03-30 15:54     ` Mel Gorman
  2009-03-30 20:38       ` David Rientjes
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2009-03-30 15:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Christoph Lameter, Nick Piggin, Martin Bligh,
	linux-kernel

On Mon, Mar 30, 2009 at 10:11:31AM +0300, Pekka Enberg wrote:
> On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> > The slab_thrash_ratio for each cache do not have non-zero defaults
> > (yet?).
> 
> If we're going to merge this code, I think it would be better to put a
> non-zero default there; otherwise we won't be able to hit potential
> performance regressions or bugs. Furthermore, the optimization is not
> very useful on large scale if people need to enable it themselves.
> 
> Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
> a difference? I'm cc'ing Mel in case he has some suggestions how to test
> it.
> 

netperf and tbench will both pound the sl*b allocator far more than sysbench
will in my opinion although I don't have figures on-hand to back that up. In
the case of netperf, it might be particular obvious if the client is on one
CPU and the server on another because I believe that means all allocs happen
on one CPU and all frees on another.

I have a vague concern that such a tunable needs to exist at all though
and wonder what workloads it can hurt when set to 20 for example versus any
other value.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30 15:54     ` Mel Gorman
@ 2009-03-30 20:38       ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, Mel Gorman wrote:

> netperf and tbench will both pound the sl*b allocator far more than sysbench
> will in my opinion although I don't have figures on-hand to back that up. In
> the case of netperf, it might be particular obvious if the client is on one
> CPU and the server on another because I believe that means all allocs happen
> on one CPU and all frees on another.
> 

My results are for two 16-core 64G machines on the same rack, one running 
netserver and the other running netperf.

> I have a vague concern that such a tunable needs to exist at all though
> and wonder what workloads it can hurt when set to 20 for example versus any
> other value.
> 

The tunable needs to exist unless a counter proposal is made that fixes 
this slub performance degradation compared to using slab.  I'd be very 
interested to hear other proposals on how to detect and remedy such 
situations in the allocator without the addition of a tunable.

As I mentioned previously in response to Pekka, it won't cause a further 
regression if sane SLAB_THRASHING_THRESHOLD and slab_thrash_ratio values 
are chosen.  The rules are pretty simple as described by the 
implementation: if a cpu slab can only allocate 20% of its objects three 
times in a row, we're going to choose a more free slab for the partial 
list while holding list_lock as opposed to constantly contending on it.  
This is particularly important for the netperf benchmark because the only 
cpu slabs that thrash are the ones with NUMA locality to the cpu taking 
the networking interrupt (because remote_node_defrag_ratio was unchanged 
from its default, meaning we avoid remote node defragmentation 98% of the 
time).

I haven't measured the fastpath implications of non-thrashing caches (the 
increment in the alloc fastpath and the conditional in the alloc slowpath 
for partial list sorting) yet, but your suggested experiments should show 
that quite well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
  2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
  2009-03-30  7:11   ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
@ 2009-03-30 14:30   ` Christoph Lameter
  2009-03-30 20:12     ` David Rientjes
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Sun, 29 Mar 2009, David Rientjes wrote:

> @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
>  	/*
>  	 * Determine the number of objects per slab
>  	 */
> +	if (oo_objects(s->oo))
> +		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
>  	s->oo = oo_make(order, size);

s->oo is set *after* you check it. Check oo_objects after the value has
been set please.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30 14:30   ` Christoph Lameter
@ 2009-03-30 20:12     ` David Rientjes
  2009-03-30 21:19       ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, Christoph Lameter wrote:

> > @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> >  	/*
> >  	 * Determine the number of objects per slab
> >  	 */
> > +	if (oo_objects(s->oo))
> > +		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> >  	s->oo = oo_make(order, size);
> 
> s->oo is set *after* you check it. Check oo_objects after the value has
> been set please.
> 

It's actually right the way I implemented it, oo_objects(s->oo) will be 0 
when this is called for kmem_cache_open() meaning there is no preexisting 
slab_thrash_ratio.  But this check is required when calculate_sizes() is 
called from order_store() to adjust the slab_thrash_ratio for the new 
objects per slab.  The above check is saving the old thrash ratio so 
the new s->min_free_watermark value can be set following the 
oo_make().  This was mentioned in the changelog for this patch:

	The value is stored in terms of the number of objects that the 
	ratio represents, not the ratio itself.  This avoids costly 
	arithmetic in the slowpath for a calculation that could otherwise 
	be done only when `slab_thrash_ratio' or `order' is changed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30 20:12     ` David Rientjes
@ 2009-03-30 21:19       ` Christoph Lameter
  2009-03-30 22:48         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 21:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, David Rientjes wrote:

> > > +	if (oo_objects(s->oo))
> > > +		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > >  	s->oo = oo_make(order, size);
> >
> > s->oo is set *after* you check it. Check oo_objects after the value has
> > been set please.
> >
>
> It's actually right the way I implemented it, oo_objects(s->oo) will be 0
> when this is called for kmem_cache_open() meaning there is no preexisting
> slab_thrash_ratio.  But this check is required when calculate_sizes() is
> called from order_store() to adjust the slab_thrash_ratio for the new
> objects per slab.  The above check is saving the old thrash ratio so
> the new s->min_free_watermark value can be set following the
> oo_make().  This was mentioned in the changelog for this patch:
>
> 	The value is stored in terms of the number of objects that the
> 	ratio represents, not the ratio itself.  This avoids costly
> 	arithmetic in the slowpath for a calculation that could otherwise
> 	be done only when `slab_thrash_ratio' or `order' is changed.

Then its the wrong place to set it. Initializations are done in
kmem_cache_open() after calculate_sizes are called.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30 21:19       ` Christoph Lameter
@ 2009-03-30 22:48         ` David Rientjes
  2009-03-31  4:44           ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 22:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, Christoph Lameter wrote:

> > > > +	if (oo_objects(s->oo))
> > > > +		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > > >  	s->oo = oo_make(order, size);
> > >
> > > s->oo is set *after* you check it. Check oo_objects after the value has
> > > been set please.
> > >
> >
> > It's actually right the way I implemented it, oo_objects(s->oo) will be 0
> > when this is called for kmem_cache_open() meaning there is no preexisting
> > slab_thrash_ratio.  But this check is required when calculate_sizes() is
> > called from order_store() to adjust the slab_thrash_ratio for the new
> > objects per slab.  The above check is saving the old thrash ratio so
> > the new s->min_free_watermark value can be set following the
> > oo_make().  This was mentioned in the changelog for this patch:
> >
> > 	The value is stored in terms of the number of objects that the
> > 	ratio represents, not the ratio itself.  This avoids costly
> > 	arithmetic in the slowpath for a calculation that could otherwise
> > 	be done only when `slab_thrash_ratio' or `order' is changed.
> 
> Then its the wrong place to set it. Initializations are done in
> kmem_cache_open() after calculate_sizes are called.
> 

The way the code is currently written, this acts as an initialization when 
there was no previous object count (i.e. its coming from 
kmem_cache_open()) and acts as an adjustment when there was a previous 
count (i.e. /sys/kernel/slab/cache/order was changed).  The only way to 
avoid adding this to calculate_sizes() would be to add logic to 
order_store() to adjust the watermark when the order changes, but that 
duplicates the same calculation that is required for initialization if 
s->min_free_watermark does get a default value in kmem_cache_open() as 
Pekka suggested.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-30 22:48         ` David Rientjes
@ 2009-03-31  4:44           ` David Rientjes
  2009-03-31 13:26             ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31  4:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, David Rientjes wrote:

> The way the code is currently written, this acts as an initialization when 
> there was no previous object count (i.e. its coming from 
> kmem_cache_open()) and acts as an adjustment when there was a previous 
> count (i.e. /sys/kernel/slab/cache/order was changed).  The only way to 
> avoid adding this to calculate_sizes() would be to add logic to 
> order_store() to adjust the watermark when the order changes, but that 
> duplicates the same calculation that is required for initialization if 
> s->min_free_watermark does get a default value in kmem_cache_open() as 
> Pekka suggested.

I applied the following to the patchset so that the initialization of the 
watermark is always separate from the updating as a result of changing 
/sys/kernel/slab/cache/order.

Since the setting of a default watermark in kmem_cache_open() will require 
a calculation to find the corresponding min_free_watermark depending on 
the object size for a pre-defined default ratio, it will require the same 
calculation that is now in order_store(), but I agree it's simpler to 
understand and justifies the code duplication.
---
diff --git a/mm/slub.c b/mm/slub.c
index 76fa5a6..61ae612 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2187,7 +2187,6 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	unsigned long flags = s->flags;
 	unsigned long size = s->objsize;
 	unsigned long align = s->align;
-	u16 thrash_ratio = 0;
 	int order;
 
 	/*
@@ -2293,13 +2292,10 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	/*
 	 * Determine the number of objects per slab
 	 */
-	if (oo_objects(s->oo))
-		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
 	s->oo = oo_make(order, size);
 	s->min = oo_make(get_order(size), size);
 	if (oo_objects(s->oo) > oo_objects(s->max))
 		s->max = s->oo;
-	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
 
 	return !!oo_objects(s->oo);
 
@@ -3824,6 +3820,7 @@ static ssize_t order_store(struct kmem_cache *s,
 				const char *buf, size_t length)
 {
 	unsigned long order;
+	unsigned long thrash_ratio;
 	int err;
 
 	err = strict_strtoul(buf, 10, &order);
@@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
 	if (order > slub_max_order || order < slub_min_order)
 		return -EINVAL;
 
+	thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
 	calculate_sizes(s, order);
+	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
 	return length;
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-31  4:44           ` David Rientjes
@ 2009-03-31 13:26             ` Christoph Lameter
  2009-03-31 17:21               ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 13:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Mon, 30 Mar 2009, David Rientjes wrote:

> @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
>  	if (order > slub_max_order || order < slub_min_order)
>  		return -EINVAL;
>
> +	thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
>  	calculate_sizes(s, order);
> +	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
>  	return length;
>  }
>

Hmmm.. Still calculating the trash ratio based on existing objects per
slab and then resetting the objects per slab to a different number.
Shouldnt the trash_ratio simply be zapped to an initial value if the
number of objects per slab changes?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-31 13:26             ` Christoph Lameter
@ 2009-03-31 17:21               ` David Rientjes
  2009-03-31 17:24                 ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31 17:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Tue, 31 Mar 2009, Christoph Lameter wrote:

> > @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
> >  	if (order > slub_max_order || order < slub_min_order)
> >  		return -EINVAL;
> >
> > +	thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> >  	calculate_sizes(s, order);
> > +	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
> >  	return length;
> >  }
> >
> 
> Hmmm.. Still calculating the trash ratio based on existing objects per
> slab and then resetting the objects per slab to a different number.
> Shouldnt the trash_ratio simply be zapped to an initial value if the
> number of objects per slab changes?
> 

Each cache with >= 20 objects per slab will get a default 
slab_thrash_ratio of 20 in v2 of the series.  If the order of a cache is 
subsequently tuned, the default slab_thrash_ratio would be cleared without 
knowledge to the user.

I'd agree that it should be cleared if the tunable had object units 
instead of a ratio, but the ratio simply applies to any given order.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-31 17:21               ` David Rientjes
@ 2009-03-31 17:24                 ` Christoph Lameter
  2009-03-31 17:35                   ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 17:24 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Tue, 31 Mar 2009, David Rientjes wrote:

> I'd agree that it should be cleared if the tunable had object units
> instead of a ratio, but the ratio simply applies to any given order.

Right but resetting the order usually has a significant impact on the
threashing behavior (if it exists). Why would we keep the threshing ratio
that was calculated for another slab configuration?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] slub: add per-cache slab thrash ratio
  2009-03-31 17:24                 ` Christoph Lameter
@ 2009-03-31 17:35                   ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-31 17:35 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel

On Tue, 31 Mar 2009, Christoph Lameter wrote:

> > I'd agree that it should be cleared if the tunable had object units
> > instead of a ratio, but the ratio simply applies to any given order.
> 
> Right but resetting the order usually has a significant impact on the
> threashing behavior (if it exists). Why would we keep the threshing ratio
> that was calculated for another slab configuration?
> 

Either the default thrashing ratio is being used and is unchanged from 
boot time in which case it will still apply to the new order, or the ratio 
has already been changed and userspace is responsible for tuning it again 
as the result of the new slab size.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] slub partial list thrashing performance degradation
  2009-03-30  5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
  2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
@ 2009-03-30  6:38 ` Pekka Enberg
  1 sibling, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-30  6:38 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> SLUB causes a performance degradation in comparison to SLAB when a
> workload has an object allocation and freeing pattern such that it spends
> more time in partial list handling than utilizing the fastpaths.

Christoph, Nick, any objections to merging this? The patches look sane
and the numbers convincing enough to me.

			Pekka


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/3] slub: add per-cache slab thrash ratio
@ 2009-03-26  9:42 David Rientjes
  2009-03-26  9:42 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-26  9:42 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the
percentage of a slab's objects that the fastpath must fulfill to not be
considered thrashing on a per-cpu basis[*].

"Thrashing" here is defined as the constant swapping of the cpu slab such
that the slowpath is followed the majority of the time because the
refilled cpu slab can only accommodate a small number of allocations.
This occurs when the object allocation and freeing pattern for a cache is
such that it spends more time swapping the cpu slab than fulfulling
fastpath allocations.

 [*] A single instance of the thrash ratio not being reached in the
     fastpath does not indicate the cpu cache is thrashing.  A
     pre-defined value will later be added to determine how many times
     the ratio must not be reached before a cache is actually thrashing.

This is defined as a ratio based on the number of objects in a cache's
slab.  This is automatically changed when /sys/kernel/slab/cache/order is
changed to reflect the same ratio.

The netperf TCP_RR benchmark illustrates slab thrashing very well with a
large number of threads.  With a test length of 60 seconds, the following
thread counts were used to show the effect of the allocation and freeing
pattern of such a workload.

Before this patchset:

	threads			Transfer Rate (per sec)
	10			66636.39
	20			96311.02
	40			103948.16
	60			140977.62
	80			166714.37
	100			190431.35
	200			244092.36

To identify the thrashing caches, the same workload was run with
CONFIG_SLUB_STATS enabled.  The following caches are obviously performing
very poorly:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH	FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	45186169	15930724	88289		61028526
	kmalloc-2048	33507239	27541884	46525		61002601

After this patchset (both caches with slab_thrash_ratios of 20):

	threads			Transfer Rate (per sec)
	10			68857.31
	20			98335.04
	40			124376.77
	60			146014.14
	80			177352.16
	100			195467.61
	200			245555.99

Although slabs may accommodate fewer objects than others when contiguous
memory cannot be allocated for a cache's order, the ratio is still based
on its configured `order' since slabs will exist on the partial list that
will be able to fulfill such a requirement.

The value is stored in terms of the number of objects that the ratio
represents, not the ratio itself.  This avoids costly arithmetic in the
slowpath for a calculation that could otherwise be done only when
`slab_thrash_ratio' or `order' is changed.

This also will adjust the configured ratio to one that can actually be
represented in terms of whole numbers: for example, if slab_thrash_ratio
is set to 20 for a cache with 64 objects, the effective ratio is actually
3:16 (or 18.75%).  This will be shown when reading the ratio since it is
better to represent the actual ratio instead of a pseudo substitute.

The slab_thrash_ratio for each cache do not have non-zero defaults
(yet?).

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |   29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -94,6 +94,7 @@ struct kmem_cache {
 #ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
 #endif
+	u16 min_free_watermark;	/* Calculated from slab thrash ratio */

 #ifdef CONFIG_NUMA
 	/*
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2190,6 +2190,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	unsigned long flags = s->flags;
 	unsigned long size = s->objsize;
 	unsigned long align = s->align;
+	u16 thrash_ratio = 0;
 	int order;

 	/*
@@ -2295,10 +2296,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	/*
 	 * Determine the number of objects per slab
 	 */
+	if (oo_objects(s->oo))
+		thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
 	s->oo = oo_make(order, size);
 	s->min = oo_make(get_order(size), size);
 	if (oo_objects(s->oo) > oo_objects(s->max))
 		s->max = s->oo;
+	s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;

 	return !!oo_objects(s->oo);

@@ -2320,6 +2324,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		goto error;

 	s->refcount = 1;
+	s->min_free_watermark = 0;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -4089,6 +4094,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
 SLAB_ATTR(remote_node_defrag_ratio);
 #endif

+static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n",
+		       s->min_free_watermark * 100 / oo_objects(s->oo));
+}
+
+static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf,
+				       size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = strict_strtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio <= 100)
+		s->min_free_watermark = oo_objects(s->oo) * ratio / 100;
+
+	return length;
+}
+SLAB_ATTR(slab_thrash_ratio);
+
 #ifdef CONFIG_SLUB_STATS
 static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
 {
@@ -4172,6 +4200,7 @@ static struct attribute *slab_attrs[] = {
 	&shrink_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&slab_thrash_ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
 #endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/3] slub: scan partial list for free slabs when thrashing
  2009-03-26  9:42 [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
@ 2009-03-26  9:42 ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-26  9:42 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel

To determine when a slab is actually thrashing, it's insufficient to only
look at the most recent allocation path.  It's perfectly valid to swap
the cpu slab with a partial slab that contains very few free objects if
the goal is to quickly fill it since slub no longer needs to track such
slabs.

This is inefficient if an object will immediately be freed so that the
full slab must be readded to the partial list.  With certain object
allocation and freeing patterns, it is possible to spend more time
processing the partial list than utilizing the fastpaths.

We already have a per-cache min_free_watermark setting that is
configurable from userspace, which helps determine when we have excessive
partial list handling.  When a slab does not fulfill its watermark, it
suggests that the cache may be thrashing.  A pre-defined value,
SLAB_THRASHING_THRESHOLD (which defaults to 3), is implemented to be used
in conjunction with this statistic to determine when a slab is actually
thrashing.

Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
is incrememted.  This counter is cleared whenever the slowpath is
invoked.  This tracks how many fastpath allocations the cpu slab has
fulfilled before it must be refilled.

When the slowpath must be invoked, a slowpath counter is incremented if
the cpu slab did not fulfill the thrashing watermark.  Otherwise, it is
decremented.

When the slowpath counter is greater than or equal to
SLAB_THRASHING_THRESHOLD, the partial list is scanned for a slab that
will be able to fulfill at least the number of objects required to not
be considered thrashing.  If no such slabs are available, the remote
nodes are defragmented (if allowed) or a new slab is allocated.

If a cpu slab must be swapped because the allocation is for a different
node, both counters are cleared since this doesn't indicate any
thrashing behavior.

When /sys/kernel/slab/cache/slab_thrash_ratio is not set, this does not
include any functional change other than the incrementing of a fastpath
counter for the per-cpu cache.

A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how
many times a partial list was deferred because no slabs could satisfy
the requisite number of objects for CONFIG_SLUB_STATS kernels.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/slub_def.h |    3 +
 mm/slub.c                |   93 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 76 insertions(+), 20 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -30,6 +30,7 @@ enum stat_item {
 	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
+	DEFERRED_PARTIAL,	/* Defer local partial list for lack of objs */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
@@ -38,6 +39,8 @@ struct kmem_cache_cpu {
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */
+	u16 fastpath_allocs;	/* Consecutive fast allocs before slowpath */
+	u16 slowpath_allocs;	/* Consecutive slow allocs before watermark */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,19 @@
  */
 #define MAX_PARTIAL 10
 
+/*
+ * Number of successive slowpath allocations that have failed to allocate at
+ * least the number of objects in the fastpath to not be slab thrashing (as
+ * defined by the cache's slab thrash ratio).
+ *
+ * When an allocation follows the slowpath, it increments a counter in its cpu
+ * cache.  If this counter exceeds the threshold, the partial list is scanned
+ * for a slab that will satisfy at least the cache's min_free_watermark in
+ * order for it to be used.  Otherwise, the slab with the most free objects is
+ * used.
+ */
+#define SLAB_THRASHING_THRESHOLD 3
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
 				SLAB_POISON | SLAB_STORE_USER)
 
@@ -1252,28 +1265,30 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
 }
 
 /*
- * Lock slab and remove from the partial list.
+ * Remove from the partial list.
  *
- * Must hold list_lock.
+ * Must hold n->list_lock and slab_lock(page).
  */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
-							struct page *page)
+static inline void freeze_slab(struct kmem_cache_node *n, struct page *page)
 {
-	if (slab_trylock(page)) {
-		list_del(&page->lru);
-		n->nr_partial--;
-		__SetPageSlubFrozen(page);
-		return 1;
-	}
-	return 0;
+	list_del(&page->lru);
+	n->nr_partial--;
+	__SetPageSlubFrozen(page);
+}
+
+static inline int skip_partial(struct kmem_cache *s, struct page *page)
+{
+	return (page->objects - page->inuse) < s->min_free_watermark;
 }
 
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_node(struct kmem_cache *s,
+				     struct kmem_cache_node *n, int thrashing)
 {
 	struct page *page;
+	int locked = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -1286,9 +1301,28 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
 
 	spin_lock(&n->list_lock);
 	list_for_each_entry(page, &n->partial, lru)
-		if (lock_and_freeze_slab(n, page))
+		if (slab_trylock(page)) {
+			/*
+			 * When the cpu cache is partial list thrashing, it's
+			 * necessary to replace the cpu slab with one that will
+			 * accommodate at least s->min_free_watermark objects
+			 * to avoid excessive list_lock contention and cache
+			 * polluting.
+			 *
+			 * If no such slabs exist on the partial list, remote
+			 * nodes are defragmented if allowed.
+			 */
+			if (thrashing && skip_partial(s, page)) {
+				slab_unlock(page);
+				locked++;
+				continue;
+			}
+			freeze_slab(n, page);
 			goto out;
+		}
 	page = NULL;
+	if (locked)
+		stat(get_cpu_slab(s, raw_smp_processor_id()), DEFERRED_PARTIAL);
 out:
 	spin_unlock(&n->list_lock);
 	return page;
@@ -1297,7 +1331,8 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
+				    int thrashing)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1336,7 +1371,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > n->min_partial) {
-			page = get_partial_node(n);
+			page = get_partial_node(s, n, thrashing);
 			if (page)
 				return page;
 		}
@@ -1348,16 +1383,17 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 /*
  * Get a partial page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node,
+				int thrashing)
 {
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	page = get_partial_node(s, get_node(s, searchnode), thrashing);
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial(s, flags, thrashing);
 }
 
 /*
@@ -1509,6 +1545,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
+	int is_empty = 0;
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1517,7 +1554,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	is_empty = node_match(c, node);
+	if (unlikely(!is_empty))
 		goto another_slab;
 
 	stat(c, ALLOC_REFILL);
@@ -1542,7 +1580,17 @@ another_slab:
 	deactivate_slab(s, c);
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
+	if (is_empty) {
+		if (c->fastpath_allocs < s->min_free_watermark)
+			c->slowpath_allocs++;
+		else if (c->slowpath_allocs)
+			c->slowpath_allocs--;
+	} else
+		c->slowpath_allocs = 0;
+	c->fastpath_allocs = 0;
+
+	new = get_partial(s, gfpflags, node,
+			  c->slowpath_allocs > SLAB_THRASHING_THRESHOLD);
 	if (new) {
 		c->page = new;
 		stat(c, ALLOC_FROM_PARTIAL);
@@ -1611,6 +1659,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	else {
 		object = c->freelist;
 		c->freelist = object[c->offset];
+		c->fastpath_allocs++;
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1919,6 +1968,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 	c->node = 0;
 	c->offset = s->offset / sizeof(void *);
 	c->objsize = s->objsize;
+	c->fastpath_allocs = 0;
+	c->slowpath_allocs = 0;
 #ifdef CONFIG_SLUB_STATS
 	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
 #endif
@@ -4172,6 +4223,7 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
 STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(DEFERRED_PARTIAL, deferred_partial);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4226,6 +4278,7 @@ static struct attribute *slab_attrs[] = {
 	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&order_fallback_attr.attr,
+	&deferred_partial_attr.attr,
 #endif
 	NULL
 };

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-03-31 17:37 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-30  5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
2009-03-30  5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-30  5:43   ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
2009-03-30  5:43     ` [patch 3/3] slub: sort parital list " David Rientjes
2009-03-30 14:41       ` Christoph Lameter
2009-03-30 20:29         ` David Rientjes
2009-03-30 14:37     ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
2009-03-30 20:22       ` David Rientjes
2009-03-30 21:20         ` Christoph Lameter
2009-03-31  7:13       ` Pekka Enberg
2009-03-31  8:23         ` David Rientjes
2009-03-31  8:49           ` Pekka Enberg
2009-03-31 13:23         ` Christoph Lameter
2009-03-30  7:11   ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
2009-03-30  8:41     ` David Rientjes
2009-03-30 15:54     ` Mel Gorman
2009-03-30 20:38       ` David Rientjes
2009-03-30 14:30   ` Christoph Lameter
2009-03-30 20:12     ` David Rientjes
2009-03-30 21:19       ` Christoph Lameter
2009-03-30 22:48         ` David Rientjes
2009-03-31  4:44           ` David Rientjes
2009-03-31 13:26             ` Christoph Lameter
2009-03-31 17:21               ` David Rientjes
2009-03-31 17:24                 ` Christoph Lameter
2009-03-31 17:35                   ` David Rientjes
2009-03-30  6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
  -- strict thread matches above, loose matches on Subject: below --
2009-03-26  9:42 [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-26  9:42 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.