* [patch 0/3] slub partial list thrashing performance degradation
@ 2009-03-30 5:43 David Rientjes
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
0 siblings, 2 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
SLUB causes a performance degradation in comparison to SLAB when a
workload has an object allocation and freeing pattern such that it spends
more time in partial list handling than utilizing the fastpaths.
This usually occurs when freeing to a non-cpu slab either due to remote
cpu freeing or freeing to a full or partial slab. When the cpu slab is
later replaced with the freeing slab, it can only satisfy a limited
number of allocations before becoming full and requiring additional
partial list handling.
When the slowpath to fastpath ratio becomes high, this partial list
handling causes the entire allocator to become very slow for the specific
workload.
The bash script at the end of this email (inline) illustrates the
performance degradation well. It uses the netperf TCP_RR benchmark to
measure transfer rates with various thread counts, each being multiples
of the number of cores. The transfer rates are reported as an aggregate
of the individual thread results.
CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are
performing quite poorly:
cache ALLOC_FASTPATH ALLOC_SLOWPATH
kmalloc-256 98125871 31585955
kmalloc-2048 77243698 52347453
cache FREE_FASTPATH FREE_SLOWPATH
kmalloc-256 173624 129538000
kmalloc-2048 90520 129500630
The majority of slowpath allocations were from the partial list
(30786261, or 97.5%, for kmalloc-256 and 51688159, or 98.7%, for
kmalloc-2048).
A large percentage of frees required the slab to be added back to the
partial list. For kmalloc-256, 30786630 (23.8%) of slowpath frees
required partial list handling. For kmalloc-2048, 51688697 (39.9%) of
slowpath frees required partial list handling.
On my 16-core machines with 64G of ram, these are the results:
# threads SLAB SLUB SLUB+patchset
16 69892 71592 69505
32 126490 95373 119731
48 138050 113072 125014
64 169240 149043 158919
80 192294 172035 179679
96 197779 187849 192154
112 217283 204962 209988
128 229848 217547 223507
144 238550 232369 234565
160 250333 239871 244789
176 256878 242712 248971
192 261611 243182 255596
[ The SLUB+patchset results were attained with the latest git plus this
patchset and slab_thrash_ratio set at 20 for both the kmalloc-256 and
the kmalloc-2048 cache. ]
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/slub_def.h | 4 +
mm/slub.c | 138 +++++++++++++++++++++++++++++++++++++++-------
2 files changed, 122 insertions(+), 20 deletions(-)
#!/bin/bash
TIME=60 # seconds
HOSTNAME=<hostname> # netserver
NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS
run_netperf() {
for i in $(seq 1 $1); do
netperf -H $HOSTNAME -t TCP_RR -l $TIME &
done
}
ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
RATE=0
ITERATIONS=$[$ITERATIONS + 1]
THREADS=$[$NR_CPUS * $ITERATIONS]
RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')
for j in $RESULTS; do
RATE=$[$RATE + ${j/.*}]
done
echo threads=$THREADS rate=$RATE
done
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
@ 2009-03-30 5:43 ` David Rientjes
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
` (2 more replies)
2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
1 sibling, 3 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the
percentage of a slab's objects that the fastpath must fulfill to not be
considered thrashing on a per-cpu basis[*].
"Thrashing" here is defined as the constant swapping of the cpu slab such
that the slowpath is followed the majority of the time because the
refilled cpu slab can only accommodate a small number of allocations.
This occurs when the object allocation and freeing pattern for a cache is
such that it spends more time swapping the cpu slab than fulfulling
fastpath allocations.
[*] A single instance of the thrash ratio not being reached in the
fastpath does not indicate the cpu cache is thrashing. A
pre-defined value will later be added to determine how many times
the ratio must not be reached before a cache is actually thrashing.
This is defined as a ratio based on the number of objects in a cache's
slab. This is automatically changed when /sys/kernel/slab/cache/order is
changed to reflect the same ratio.
The netperf TCP_RR benchmark illustrates slab thrashing very well with a
large number of threads. With a test length of 60 seconds, the following
thread counts were used to show the effect of the allocation and freeing
pattern of such a workload.
Before this patchset:
threads Transfer Rate (per sec)
16 71592
32 95373
48 113072
64 149043
80 172035
96 187849
112 204962
128 217547
144 232369
160 239871
176 242712
192 243182
To identify the thrashing caches, the same workload was run with
CONFIG_SLUB_STATS enabled. The following caches are obviously performing
very poorly:
cache ALLOC_FASTPATH ALLOC_SLOWPATH
kmalloc-256 98125871 31585955
kmalloc-2048 77243698 52347453
cache FREE_FASTPATH FREE_SLOWPATH
kmalloc-256 173624 129538000
kmalloc-2048 90520 129500630
After this patchset (both caches with slab_thrash_ratios of 20):
threads Transfer Rate (per sec)
16 69505
32 119731
48 125014
64 158919
80 179679
96 192154
112 209988
128 223507
144 234565
160 244789
176 248971
192 255596
Although slabs may accommodate fewer objects than others when contiguous
memory cannot be allocated for a cache's order, the ratio is still based
on its configured `order' since slabs will exist on the partial list that
will be able to fulfill such a requirement.
The value is stored in terms of the number of objects that the ratio
represents, not the ratio itself. This avoids costly arithmetic in the
slowpath for a calculation that could otherwise be done only when
`slab_thrash_ratio' or `order' is changed.
This also will adjust the configured ratio to one that can actually be
represented in terms of whole numbers: for example, if slab_thrash_ratio
is set to 20 for a cache with 64 objects, the effective ratio is actually
3:16 (or 18.75%). This will be shown when reading the ratio since it is
better to represent the actual ratio instead of a pseudo substitute.
The slab_thrash_ratio for each cache do not have non-zero defaults
(yet?).
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/slub_def.h | 1 +
mm/slub.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -94,6 +94,7 @@ struct kmem_cache {
#ifdef CONFIG_SLUB_DEBUG
struct kobject kobj; /* For sysfs */
#endif
+ u16 min_free_watermark; /* Calculated from slab thrash ratio */
#ifdef CONFIG_NUMA
/*
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2186,6 +2186,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
unsigned long flags = s->flags;
unsigned long size = s->objsize;
unsigned long align = s->align;
+ u16 thrash_ratio = 0;
int order;
/*
@@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
/*
* Determine the number of objects per slab
*/
+ if (oo_objects(s->oo))
+ thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
s->oo = oo_make(order, size);
s->min = oo_make(get_order(size), size);
if (oo_objects(s->oo) > oo_objects(s->max))
s->max = s->oo;
+ s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
return !!oo_objects(s->oo);
@@ -2321,6 +2325,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
*/
set_min_partial(s, ilog2(s->size));
s->refcount = 1;
+ s->min_free_watermark = 0;
#ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 1000;
#endif
@@ -4110,6 +4115,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
SLAB_ATTR(remote_node_defrag_ratio);
#endif
+static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n",
+ s->min_free_watermark * 100 / oo_objects(s->oo));
+}
+
+static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf,
+ size_t length)
+{
+ unsigned long ratio;
+ int err;
+
+ err = strict_strtoul(buf, 10, &ratio);
+ if (err)
+ return err;
+
+ if (ratio <= 100)
+ s->min_free_watermark = oo_objects(s->oo) * ratio / 100;
+
+ return length;
+}
+SLAB_ATTR(slab_thrash_ratio);
+
#ifdef CONFIG_SLUB_STATS
static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
{
@@ -4194,6 +4222,7 @@ static struct attribute *slab_attrs[] = {
&shrink_attr.attr,
&alloc_calls_attr.attr,
&free_calls_attr.attr,
+ &slab_thrash_ratio_attr.attr,
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
@ 2009-03-30 5:43 ` David Rientjes
2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes
2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
2009-03-30 14:30 ` Christoph Lameter
2 siblings, 2 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
To determine when a slab is actually thrashing, it's insufficient to only
look at the most recent allocation path. It's perfectly valid to swap
the cpu slab with a partial slab that contains very few free objects if
the goal is to quickly fill it since slub no longer needs to track such
slabs.
This is inefficient if an object will immediately be freed so that the
full slab must be readded to the partial list. With certain object
allocation and freeing patterns, it is possible to spend more time
processing the partial list than utilizing the fastpaths.
We already have a per-cache min_free_watermark setting that is
configurable from userspace, which helps determine when we have excessive
partial list handling. When a slab does not fulfill its watermark, it
suggests that the cache may be thrashing. A pre-defined value,
SLAB_THRASHING_THRESHOLD (which defaults to 3), is implemented to be used
in conjunction with this statistic to determine when a slab is actually
thrashing.
Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
is incrememted. This counter is cleared whenever the slowpath is
invoked. This tracks how many fastpath allocations the cpu slab has
fulfilled before it must be refilled.
When the slowpath must be invoked, a slowpath counter is incremented if
the cpu slab did not fulfill the thrashing watermark. Otherwise, it is
decremented.
When the slowpath counter is greater than or equal to
SLAB_THRASHING_THRESHOLD, the partial list is scanned for a slab that
will be able to fulfill at least the number of objects required to not
be considered thrashing. If no such slabs are available, the remote
nodes are defragmented (if allowed) or a new slab is allocated.
If a cpu slab must be swapped because the allocation is for a different
node, both counters are cleared since this doesn't indicate any
thrashing behavior.
When /sys/kernel/slab/cache/slab_thrash_ratio is not set, this does not
include any functional change other than the incrementing of a fastpath
counter for the per-cpu cache.
A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how
many times a partial list was deferred because no slabs could satisfy
the requisite number of objects for CONFIG_SLUB_STATS kernels.
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/slub_def.h | 3 +
mm/slub.c | 93 ++++++++++++++++++++++++++++++++++++----------
2 files changed, 76 insertions(+), 20 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -30,6 +30,7 @@ enum stat_item {
DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */
DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
ORDER_FALLBACK, /* Number of times fallback was necessary */
+ DEFERRED_PARTIAL, /* Defer local partial list for lack of objs */
NR_SLUB_STAT_ITEMS };
struct kmem_cache_cpu {
@@ -38,6 +39,8 @@ struct kmem_cache_cpu {
int node; /* The node of the page (or -1 for debug) */
unsigned int offset; /* Freepointer offset (in word units) */
unsigned int objsize; /* Size of an object (from kmem_cache) */
+ u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */
+ u16 slowpath_allocs; /* Consecutive slow allocs before watermark */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,19 @@
*/
#define MAX_PARTIAL 10
+/*
+ * Number of successive slowpath allocations that have failed to allocate at
+ * least the number of objects in the fastpath to not be slab thrashing (as
+ * defined by the cache's slab thrash ratio).
+ *
+ * When an allocation follows the slowpath, it increments a counter in its cpu
+ * cache. If this counter exceeds the threshold, the partial list is scanned
+ * for a slab that will satisfy at least the cache's min_free_watermark in
+ * order for it to be used. Otherwise, the slab with the most free objects is
+ * used.
+ */
+#define SLAB_THRASHING_THRESHOLD 3
+
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)
@@ -1246,28 +1259,30 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
}
/*
- * Lock slab and remove from the partial list.
+ * Remove from the partial list.
*
- * Must hold list_lock.
+ * Must hold n->list_lock and slab_lock(page).
*/
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
- struct page *page)
+static inline void freeze_slab(struct kmem_cache_node *n, struct page *page)
{
- if (slab_trylock(page)) {
- list_del(&page->lru);
- n->nr_partial--;
- __SetPageSlubFrozen(page);
- return 1;
- }
- return 0;
+ list_del(&page->lru);
+ n->nr_partial--;
+ __SetPageSlubFrozen(page);
+}
+
+static inline int skip_partial(struct kmem_cache *s, struct page *page)
+{
+ return (page->objects - page->inuse) < s->min_free_watermark;
}
/*
* Try to allocate a partial slab from a specific node.
*/
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_node(struct kmem_cache *s,
+ struct kmem_cache_node *n, int thrashing)
{
struct page *page;
+ int locked = 0;
/*
* Racy check. If we mistakenly see no partial slabs then we
@@ -1280,9 +1295,28 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
spin_lock(&n->list_lock);
list_for_each_entry(page, &n->partial, lru)
- if (lock_and_freeze_slab(n, page))
+ if (slab_trylock(page)) {
+ /*
+ * When the cpu cache is partial list thrashing, it's
+ * necessary to replace the cpu slab with one that will
+ * accommodate at least s->min_free_watermark objects
+ * to avoid excessive list_lock contention and cache
+ * polluting.
+ *
+ * If no such slabs exist on the partial list, remote
+ * nodes are defragmented if allowed.
+ */
+ if (thrashing && skip_partial(s, page)) {
+ slab_unlock(page);
+ locked++;
+ continue;
+ }
+ freeze_slab(n, page);
goto out;
+ }
page = NULL;
+ if (locked)
+ stat(get_cpu_slab(s, raw_smp_processor_id()), DEFERRED_PARTIAL);
out:
spin_unlock(&n->list_lock);
return page;
@@ -1291,7 +1325,8 @@ out:
/*
* Get a page from somewhere. Search in increasing NUMA distances.
*/
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
+ int thrashing)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -1330,7 +1365,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
n->nr_partial > s->min_partial) {
- page = get_partial_node(n);
+ page = get_partial_node(s, n, thrashing);
if (page)
return page;
}
@@ -1342,16 +1377,17 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
/*
* Get a partial page, lock it and return it.
*/
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node,
+ int thrashing)
{
struct page *page;
int searchnode = (node == -1) ? numa_node_id() : node;
- page = get_partial_node(get_node(s, searchnode));
+ page = get_partial_node(s, get_node(s, searchnode), thrashing);
if (page || (flags & __GFP_THISNODE))
return page;
- return get_any_partial(s, flags);
+ return get_any_partial(s, flags, thrashing);
}
/*
@@ -1503,6 +1539,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void **object;
struct page *new;
+ int is_empty = 0;
/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
@@ -1511,7 +1548,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto new_slab;
slab_lock(c->page);
- if (unlikely(!node_match(c, node)))
+ is_empty = node_match(c, node);
+ if (unlikely(!is_empty))
goto another_slab;
stat(c, ALLOC_REFILL);
@@ -1536,7 +1574,17 @@ another_slab:
deactivate_slab(s, c);
new_slab:
- new = get_partial(s, gfpflags, node);
+ if (is_empty) {
+ if (c->fastpath_allocs < s->min_free_watermark)
+ c->slowpath_allocs++;
+ else if (c->slowpath_allocs)
+ c->slowpath_allocs--;
+ } else
+ c->slowpath_allocs = 0;
+ c->fastpath_allocs = 0;
+
+ new = get_partial(s, gfpflags, node,
+ c->slowpath_allocs > SLAB_THRASHING_THRESHOLD);
if (new) {
c->page = new;
stat(c, ALLOC_FROM_PARTIAL);
@@ -1605,6 +1653,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
else {
object = c->freelist;
c->freelist = object[c->offset];
+ c->fastpath_allocs++;
stat(c, ALLOC_FASTPATH);
}
local_irq_restore(flags);
@@ -1917,6 +1966,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
c->node = 0;
c->offset = s->offset / sizeof(void *);
c->objsize = s->objsize;
+ c->fastpath_allocs = 0;
+ c->slowpath_allocs = 0;
#ifdef CONFIG_SLUB_STATS
memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
#endif
@@ -4193,6 +4244,7 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
STAT_ATTR(ORDER_FALLBACK, order_fallback);
+STAT_ATTR(DEFERRED_PARTIAL, deferred_partial);
#endif
static struct attribute *slab_attrs[] = {
@@ -4248,6 +4300,7 @@ static struct attribute *slab_attrs[] = {
&deactivate_to_tail_attr.attr,
&deactivate_remote_frees_attr.attr,
&order_fallback_attr.attr,
+ &deferred_partial_attr.attr,
#endif
NULL
};
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 3/3] slub: sort parital list when thrashing
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
@ 2009-03-30 5:43 ` David Rientjes
2009-03-30 14:41 ` Christoph Lameter
2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
Caches that are cpu slab thrashing will scan their entire partial list
until a slab is found that will satisfy at least the requisite number of
allocations so that it will not be considered thrashing (as defined by
/sys/kernel/slab/cache/slab_thrash_ratio).
The partial list can be extremely long and its scanning requires that
list_lock is held for that particular node. This can be inefficient if
slabs at the head of the list are not appropriate cpu slab replacements.
When an object is freed, the number of free objects for its slab is
calculated if the cpu cache is currently thrashing. If it can satisfy
the requisite number of allocations so that the slab thrash ratio is
exceeded, it is moved to the head of the partial list. This minimizes
the time spent holding list_lock and can help cacheline optimizations
for recently freed objects.
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/slub.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1258,6 +1258,13 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
spin_unlock(&n->list_lock);
}
+static void move_partial_to_head(struct kmem_cache_node *n, struct page *page)
+{
+ spin_lock(&n->list_lock);
+ list_move(&page->lru, &n->partial);
+ spin_unlock(&n->list_lock);
+}
+
/*
* Remove from the partial list.
*
@@ -1720,6 +1727,15 @@ checks_ok:
if (unlikely(!prior)) {
add_partial(get_node(s, page_to_nid(page)), page, 1);
stat(c, FREE_ADD_PARTIAL);
+ } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
+ /*
+ * If the cache is actively slab thrashing, it's necessary to
+ * move partial slabs to the head of the list so there isn't
+ * excessive partial list scanning while holding list_lock.
+ */
+ if (!skip_partial(s, page))
+ move_partial_to_head(get_node(s, page_to_nid(page)),
+ page);
}
out_unlock:
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 0/3] slub partial list thrashing performance degradation
2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
@ 2009-03-30 6:38 ` Pekka Enberg
1 sibling, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-30 6:38 UTC (permalink / raw)
To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> SLUB causes a performance degradation in comparison to SLAB when a
> workload has an object allocation and freeing pattern such that it spends
> more time in partial list handling than utilizing the fastpaths.
Christoph, Nick, any objections to merging this? The patches look sane
and the numbers convincing enough to me.
Pekka
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
@ 2009-03-30 7:11 ` Pekka Enberg
2009-03-30 8:41 ` David Rientjes
2009-03-30 15:54 ` Mel Gorman
2009-03-30 14:30 ` Christoph Lameter
2 siblings, 2 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-30 7:11 UTC (permalink / raw)
To: David Rientjes
Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman
On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> The slab_thrash_ratio for each cache do not have non-zero defaults
> (yet?).
If we're going to merge this code, I think it would be better to put a
non-zero default there; otherwise we won't be able to hit potential
performance regressions or bugs. Furthermore, the optimization is not
very useful on large scale if people need to enable it themselves.
Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
a difference? I'm cc'ing Mel in case he has some suggestions how to test
it.
Pekka
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
@ 2009-03-30 8:41 ` David Rientjes
2009-03-30 15:54 ` Mel Gorman
1 sibling, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 8:41 UTC (permalink / raw)
To: Pekka Enberg
Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman
On Mon, 30 Mar 2009, Pekka Enberg wrote:
> > The slab_thrash_ratio for each cache do not have non-zero defaults
> > (yet?).
>
> If we're going to merge this code, I think it would be better to put a
> non-zero default there; otherwise we won't be able to hit potential
> performance regressions or bugs. Furthermore, the optimization is not
> very useful on large scale if people need to enable it themselves.
>
> Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
> a difference? I'm cc'ing Mel in case he has some suggestions how to test
> it.
>
It won't cause a regression if sane SLAB_THRASHING_THRESHOLD and
slab_thrash_ratio values are set since the contention on list_lock will
always be slower than utilizing a more free cpu slab when its thrashing.
I agree that there should be a default value and I was originally going
to propose the following as the fourth patch in the series, but I wanted
to generate commentary on the approach first and there's always a
hesitation when changing the default behavior of the entire allocator for
workloads with very specific behavior that trigger this type of problem.
The fact that we need a tunable for this is unfortunate, but there doesn't
seem to be any other way to detect such situations and adjust the partial
list handling so that list_lock isn't contended so much and the allocation
slowpath to fastpath ratio isn't so high.
I'd be interested to hear other people's approaches.
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -147,6 +147,12 @@
*/
#define SLAB_THRASHING_THRESHOLD 3
+/*
+ * Default slab thrash ratio, used to define when a slab is thrashing for a
+ * particular cpu.
+ */
+#define DEFAULT_SLAB_THRASH_RATIO 20
+
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)
@@ -2392,7 +2398,14 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
*/
set_min_partial(s, ilog2(s->size));
s->refcount = 1;
- s->min_free_watermark = 0;
+ s->min_free_watermark = oo_objects(s->oo) *
+ DEFAULT_SLAB_THRASH_RATIO / 100;
+ /*
+ * It doesn't make sense to define a slab as thrashing if its threshold
+ * is fewer than 4 objects.
+ */
+ if (s->min_free_watermark < 4)
+ s->min_free_watermark = 0;
#ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 1000;
#endif
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
@ 2009-03-30 14:30 ` Christoph Lameter
2009-03-30 20:12 ` David Rientjes
2 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:30 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Sun, 29 Mar 2009, David Rientjes wrote:
> @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> /*
> * Determine the number of objects per slab
> */
> + if (oo_objects(s->oo))
> + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> s->oo = oo_make(order, size);
s->oo is set *after* you check it. Check oo_objects after the value has
been set please.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes
@ 2009-03-30 14:37 ` Christoph Lameter
2009-03-30 20:22 ` David Rientjes
2009-03-31 7:13 ` Pekka Enberg
1 sibling, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:37 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Sun, 29 Mar 2009, David Rientjes wrote:
> Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> is incrememted. This counter is cleared whenever the slowpath is
> invoked. This tracks how many fastpath allocations the cpu slab has
> fulfilled before it must be refilled.
That adds fastpath overhead and it shows for small objects in your tests.
> A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how
> many times a partial list was deferred because no slabs could satisfy
> the requisite number of objects for CONFIG_SLUB_STATS kernels.
Interesting approach.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/3] slub: sort parital list when thrashing
2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes
@ 2009-03-30 14:41 ` Christoph Lameter
2009-03-30 20:29 ` David Rientjes
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 14:41 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Sun, 29 Mar 2009, David Rientjes wrote:
> @@ -1720,6 +1727,15 @@ checks_ok:
> if (unlikely(!prior)) {
> add_partial(get_node(s, page_to_nid(page)), page, 1);
> stat(c, FREE_ADD_PARTIAL);
> + } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
> + /*
> + * If the cache is actively slab thrashing, it's necessary to
> + * move partial slabs to the head of the list so there isn't
> + * excessive partial list scanning while holding list_lock.
> + */
> + if (!skip_partial(s, page))
> + move_partial_to_head(get_node(s, page_to_nid(page)),
> + page);
> }
>
> out_unlock:
>
This again adds code to a pretty hot path.
What is the impact of the additional hot path code?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
2009-03-30 8:41 ` David Rientjes
@ 2009-03-30 15:54 ` Mel Gorman
2009-03-30 20:38 ` David Rientjes
1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2009-03-30 15:54 UTC (permalink / raw)
To: Pekka Enberg
Cc: David Rientjes, Christoph Lameter, Nick Piggin, Martin Bligh,
linux-kernel
On Mon, Mar 30, 2009 at 10:11:31AM +0300, Pekka Enberg wrote:
> On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote:
> > The slab_thrash_ratio for each cache do not have non-zero defaults
> > (yet?).
>
> If we're going to merge this code, I think it would be better to put a
> non-zero default there; otherwise we won't be able to hit potential
> performance regressions or bugs. Furthermore, the optimization is not
> very useful on large scale if people need to enable it themselves.
>
> Maybe stick 20 there and run tbench, sysbench, et al to see if it makes
> a difference? I'm cc'ing Mel in case he has some suggestions how to test
> it.
>
netperf and tbench will both pound the sl*b allocator far more than sysbench
will in my opinion although I don't have figures on-hand to back that up. In
the case of netperf, it might be particular obvious if the client is on one
CPU and the server on another because I believe that means all allocs happen
on one CPU and all frees on another.
I have a vague concern that such a tunable needs to exist at all though
and wonder what workloads it can hurt when set to 20 for example versus any
other value.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 14:30 ` Christoph Lameter
@ 2009-03-30 20:12 ` David Rientjes
2009-03-30 21:19 ` Christoph Lameter
0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:12 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, Christoph Lameter wrote:
> > @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> > /*
> > * Determine the number of objects per slab
> > */
> > + if (oo_objects(s->oo))
> > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > s->oo = oo_make(order, size);
>
> s->oo is set *after* you check it. Check oo_objects after the value has
> been set please.
>
It's actually right the way I implemented it, oo_objects(s->oo) will be 0
when this is called for kmem_cache_open() meaning there is no preexisting
slab_thrash_ratio. But this check is required when calculate_sizes() is
called from order_store() to adjust the slab_thrash_ratio for the new
objects per slab. The above check is saving the old thrash ratio so
the new s->min_free_watermark value can be set following the
oo_make(). This was mentioned in the changelog for this patch:
The value is stored in terms of the number of objects that the
ratio represents, not the ratio itself. This avoids costly
arithmetic in the slowpath for a calculation that could otherwise
be done only when `slab_thrash_ratio' or `order' is changed.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
@ 2009-03-30 20:22 ` David Rientjes
2009-03-30 21:20 ` Christoph Lameter
2009-03-31 7:13 ` Pekka Enberg
1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:22 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, Christoph Lameter wrote:
> > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > is incrememted. This counter is cleared whenever the slowpath is
> > invoked. This tracks how many fastpath allocations the cpu slab has
> > fulfilled before it must be refilled.
>
> That adds fastpath overhead and it shows for small objects in your tests.
>
Indeed, which is unavoidable in this case. The only other way of tracking
the "thrashing history" I can think of would be bitshifting a 1 for
slowpath and 0 for fastpath, for example, into an unsigned long. That,
however, requires a hamming weight calculation in the slowpath and doesn't
scale nearly as well as simply an incrementing a counter.
If there's other approaches on tracking such instances, I'd be interested
to hear them.
Btw, is cl@linux.com your new email address or are all
linux-foundation.org emails going to eventually migrate to the new domain?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/3] slub: sort parital list when thrashing
2009-03-30 14:41 ` Christoph Lameter
@ 2009-03-30 20:29 ` David Rientjes
0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:29 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, Christoph Lameter wrote:
> > @@ -1720,6 +1727,15 @@ checks_ok:
> > if (unlikely(!prior)) {
> > add_partial(get_node(s, page_to_nid(page)), page, 1);
> > stat(c, FREE_ADD_PARTIAL);
> > + } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) {
> > + /*
> > + * If the cache is actively slab thrashing, it's necessary to
> > + * move partial slabs to the head of the list so there isn't
> > + * excessive partial list scanning while holding list_lock.
> > + */
> > + if (!skip_partial(s, page))
> > + move_partial_to_head(get_node(s, page_to_nid(page)),
> > + page);
> > }
> >
> > out_unlock:
> >
>
> This again adds code to a pretty hot path.
>
> What is the impact of the additional hot path code?
>
I'll be collecting more data now that there's a general desire for a
default slab_thrash_ratio value, so we'll implicitly see the performance
degradation for non-thrashing slab caches. Mel had suggested a couple of
benchmarks to try and my hypothesis is that they won't regress with a
default ratio of 20 for all caches with >= 20 objects per slab (at least a
4 object threshold for determining thrash).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 15:54 ` Mel Gorman
@ 2009-03-30 20:38 ` David Rientjes
0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-30 20:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Pekka Enberg, Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, Mel Gorman wrote:
> netperf and tbench will both pound the sl*b allocator far more than sysbench
> will in my opinion although I don't have figures on-hand to back that up. In
> the case of netperf, it might be particular obvious if the client is on one
> CPU and the server on another because I believe that means all allocs happen
> on one CPU and all frees on another.
>
My results are for two 16-core 64G machines on the same rack, one running
netserver and the other running netperf.
> I have a vague concern that such a tunable needs to exist at all though
> and wonder what workloads it can hurt when set to 20 for example versus any
> other value.
>
The tunable needs to exist unless a counter proposal is made that fixes
this slub performance degradation compared to using slab. I'd be very
interested to hear other proposals on how to detect and remedy such
situations in the allocator without the addition of a tunable.
As I mentioned previously in response to Pekka, it won't cause a further
regression if sane SLAB_THRASHING_THRESHOLD and slab_thrash_ratio values
are chosen. The rules are pretty simple as described by the
implementation: if a cpu slab can only allocate 20% of its objects three
times in a row, we're going to choose a more free slab for the partial
list while holding list_lock as opposed to constantly contending on it.
This is particularly important for the netperf benchmark because the only
cpu slabs that thrash are the ones with NUMA locality to the cpu taking
the networking interrupt (because remote_node_defrag_ratio was unchanged
from its default, meaning we avoid remote node defragmentation 98% of the
time).
I haven't measured the fastpath implications of non-thrashing caches (the
increment in the alloc fastpath and the conditional in the alloc slowpath
for partial list sorting) yet, but your suggested experiments should show
that quite well.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 20:12 ` David Rientjes
@ 2009-03-30 21:19 ` Christoph Lameter
2009-03-30 22:48 ` David Rientjes
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 21:19 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, David Rientjes wrote:
> > > + if (oo_objects(s->oo))
> > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > > s->oo = oo_make(order, size);
> >
> > s->oo is set *after* you check it. Check oo_objects after the value has
> > been set please.
> >
>
> It's actually right the way I implemented it, oo_objects(s->oo) will be 0
> when this is called for kmem_cache_open() meaning there is no preexisting
> slab_thrash_ratio. But this check is required when calculate_sizes() is
> called from order_store() to adjust the slab_thrash_ratio for the new
> objects per slab. The above check is saving the old thrash ratio so
> the new s->min_free_watermark value can be set following the
> oo_make(). This was mentioned in the changelog for this patch:
>
> The value is stored in terms of the number of objects that the
> ratio represents, not the ratio itself. This avoids costly
> arithmetic in the slowpath for a calculation that could otherwise
> be done only when `slab_thrash_ratio' or `order' is changed.
Then its the wrong place to set it. Initializations are done in
kmem_cache_open() after calculate_sizes are called.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-30 20:22 ` David Rientjes
@ 2009-03-30 21:20 ` Christoph Lameter
0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-30 21:20 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, David Rientjes wrote:
> Btw, is cl@linux.com your new email address or are all
> linux-foundation.org emails going to eventually migrate to the new domain?
Dont know. I just got the new email address (from the LF) and its shorter
than linux-foundation.org so I started using it.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 21:19 ` Christoph Lameter
@ 2009-03-30 22:48 ` David Rientjes
2009-03-31 4:44 ` David Rientjes
0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-30 22:48 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, Christoph Lameter wrote:
> > > > + if (oo_objects(s->oo))
> > > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > > > s->oo = oo_make(order, size);
> > >
> > > s->oo is set *after* you check it. Check oo_objects after the value has
> > > been set please.
> > >
> >
> > It's actually right the way I implemented it, oo_objects(s->oo) will be 0
> > when this is called for kmem_cache_open() meaning there is no preexisting
> > slab_thrash_ratio. But this check is required when calculate_sizes() is
> > called from order_store() to adjust the slab_thrash_ratio for the new
> > objects per slab. The above check is saving the old thrash ratio so
> > the new s->min_free_watermark value can be set following the
> > oo_make(). This was mentioned in the changelog for this patch:
> >
> > The value is stored in terms of the number of objects that the
> > ratio represents, not the ratio itself. This avoids costly
> > arithmetic in the slowpath for a calculation that could otherwise
> > be done only when `slab_thrash_ratio' or `order' is changed.
>
> Then its the wrong place to set it. Initializations are done in
> kmem_cache_open() after calculate_sizes are called.
>
The way the code is currently written, this acts as an initialization when
there was no previous object count (i.e. its coming from
kmem_cache_open()) and acts as an adjustment when there was a previous
count (i.e. /sys/kernel/slab/cache/order was changed). The only way to
avoid adding this to calculate_sizes() would be to add logic to
order_store() to adjust the watermark when the order changes, but that
duplicates the same calculation that is required for initialization if
s->min_free_watermark does get a default value in kmem_cache_open() as
Pekka suggested.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-30 22:48 ` David Rientjes
@ 2009-03-31 4:44 ` David Rientjes
2009-03-31 13:26 ` Christoph Lameter
0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31 4:44 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, David Rientjes wrote:
> The way the code is currently written, this acts as an initialization when
> there was no previous object count (i.e. its coming from
> kmem_cache_open()) and acts as an adjustment when there was a previous
> count (i.e. /sys/kernel/slab/cache/order was changed). The only way to
> avoid adding this to calculate_sizes() would be to add logic to
> order_store() to adjust the watermark when the order changes, but that
> duplicates the same calculation that is required for initialization if
> s->min_free_watermark does get a default value in kmem_cache_open() as
> Pekka suggested.
I applied the following to the patchset so that the initialization of the
watermark is always separate from the updating as a result of changing
/sys/kernel/slab/cache/order.
Since the setting of a default watermark in kmem_cache_open() will require
a calculation to find the corresponding min_free_watermark depending on
the object size for a pre-defined default ratio, it will require the same
calculation that is now in order_store(), but I agree it's simpler to
understand and justifies the code duplication.
---
diff --git a/mm/slub.c b/mm/slub.c
index 76fa5a6..61ae612 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2187,7 +2187,6 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
unsigned long flags = s->flags;
unsigned long size = s->objsize;
unsigned long align = s->align;
- u16 thrash_ratio = 0;
int order;
/*
@@ -2293,13 +2292,10 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
/*
* Determine the number of objects per slab
*/
- if (oo_objects(s->oo))
- thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
s->oo = oo_make(order, size);
s->min = oo_make(get_order(size), size);
if (oo_objects(s->oo) > oo_objects(s->max))
s->max = s->oo;
- s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
return !!oo_objects(s->oo);
@@ -3824,6 +3820,7 @@ static ssize_t order_store(struct kmem_cache *s,
const char *buf, size_t length)
{
unsigned long order;
+ unsigned long thrash_ratio;
int err;
err = strict_strtoul(buf, 10, &order);
@@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
if (order > slub_max_order || order < slub_min_order)
return -EINVAL;
+ thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
calculate_sizes(s, order);
+ s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
return length;
}
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
2009-03-30 20:22 ` David Rientjes
@ 2009-03-31 7:13 ` Pekka Enberg
2009-03-31 8:23 ` David Rientjes
2009-03-31 13:23 ` Christoph Lameter
1 sibling, 2 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-31 7:13 UTC (permalink / raw)
To: Christoph Lameter; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel
On Sun, 29 Mar 2009, David Rientjes wrote:
> > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > is incrememted. This counter is cleared whenever the slowpath is
> > invoked. This tracks how many fastpath allocations the cpu slab has
> > fulfilled before it must be refilled.
On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> That adds fastpath overhead and it shows for small objects in your tests.
Yup, and looking at this:
+ u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */
+ u16 slowpath_allocs; /* Consecutive slow allocs before watermark */
How much do operations on u16 hurt on, say, x86-64? It's nice that
sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
have bigger cache lines, the types could be wider.
Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
btw?
Pekka
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-31 7:13 ` Pekka Enberg
@ 2009-03-31 8:23 ` David Rientjes
2009-03-31 8:49 ` Pekka Enberg
2009-03-31 13:23 ` Christoph Lameter
1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31 8:23 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
On Tue, 31 Mar 2009, Pekka Enberg wrote:
> On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > That adds fastpath overhead and it shows for small objects in your tests.
>
> Yup, and looking at this:
>
> + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */
> + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */
>
> How much do operations on u16 hurt on, say, x86-64?
As opposed to unsigned int? These simply use the word variations of the
mov, test, cmp, and inc instructions instead of long. It's the same
tradeoff when using the u16 slub fields within struct page except it's not
strictly required in this instance because of size limitations, but rather
for cacheline optimization.
> It's nice that
> sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
> have bigger cache lines, the types could be wider.
>
Right, this would not change the unpacked size of the struct whereas using
unsigned int would.
Since MAX_OBJS_PER_PAGE (which should really be renamed MAX_OBJS_PER_SLAB)
ensures there is no overflow for u16 types, the only time fastpath_allocs
would need to be wider is when the object size is sufficiently small and
there had been frees to the cpu slab so that it overflows. In this
circumstance, slowpath_allocs would simply be incremented and it would be
corrected the next time a cpu slab does allocate beyond the threshold
(SLAB_THRASHING_THRESHOLD should never be 1). The chance of reaching the
threshold on successive fastpath counter overflows grows exponentially.
And since slowpath_allocs will never overflow because it's capped at
SLAB_THRASHING_THRESHOLD + 1 (the cpu slab will be refilled with a slab
that will ensure slowpath_allocs will be decremented the next time the
slowpath is invoked), overflow isn't an immediate problem with either.
> Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
> btw?
>
This was removed in 4c93c355d5d563f300df7e61ef753d7a064411e9.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-31 8:23 ` David Rientjes
@ 2009-03-31 8:49 ` Pekka Enberg
0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2009-03-31 8:49 UTC (permalink / raw)
To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
Hi David,
On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > > That adds fastpath overhead and it shows for small objects in your tests.
On Tue, 31 Mar 2009, Pekka Enberg wrote:
> > Yup, and looking at this:
> >
> > + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */
> > + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */
> >
> > How much do operations on u16 hurt on, say, x86-64?
On Tue, 2009-03-31 at 01:23 -0700, David Rientjes wrote:
> As opposed to unsigned int? These simply use the word variations of the
> mov, test, cmp, and inc instructions instead of long. It's the same
> tradeoff when using the u16 slub fields within struct page except it's not
> strictly required in this instance because of size limitations, but rather
> for cacheline optimization.
I was thinking of partial register stalls. But looking at it on x86-64,
the generated asm seems sane. I see tons of branch instructions, though,
so simplifying this somehow:
+ if (is_empty) {
+ if (c->fastpath_allocs < s->min_free_watermark)
+ c->slowpath_allocs++;
+ else if (c->slowpath_allocs)
+ c->slowpath_allocs--;
+ } else
+ c->slowpath_allocs = 0;
+ c->fastpath_allocs = 0;
would be most welcome.
Pekka
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing
2009-03-31 7:13 ` Pekka Enberg
2009-03-31 8:23 ` David Rientjes
@ 2009-03-31 13:23 ` Christoph Lameter
1 sibling, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 13:23 UTC (permalink / raw)
To: Pekka Enberg; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel
On Tue, 31 Mar 2009, Pekka Enberg wrote:
> On Sun, 29 Mar 2009, David Rientjes wrote:
> > > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter
> > > is incrememted. This counter is cleared whenever the slowpath is
> > > invoked. This tracks how many fastpath allocations the cpu slab has
> > > fulfilled before it must be refilled.
>
> On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote:
> > That adds fastpath overhead and it shows for small objects in your tests.
>
> Yup, and looking at this:
>
> + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */
> + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */
>
> How much do operations on u16 hurt on, say, x86-64? It's nice that
> sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that
> have bigger cache lines, the types could be wider.
>
> Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp
> btw?
Because it is either allocated using kmalloc and aligned to a cacheline
boundary there or the kmem_cache_cpu entries come from the percpu
definition for kmem_cache_cpu. There we dont need cacheline alignment
since they are tightly packet. If the cacheline size is 64 bit then
neighboring kmem_cache_cpus fit into one cacheline which reduces cache
footprint and increased cache hotness.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-31 4:44 ` David Rientjes
@ 2009-03-31 13:26 ` Christoph Lameter
2009-03-31 17:21 ` David Rientjes
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 13:26 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Mon, 30 Mar 2009, David Rientjes wrote:
> @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
> if (order > slub_max_order || order < slub_min_order)
> return -EINVAL;
>
> + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> calculate_sizes(s, order);
> + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
> return length;
> }
>
Hmmm.. Still calculating the trash ratio based on existing objects per
slab and then resetting the objects per slab to a different number.
Shouldnt the trash_ratio simply be zapped to an initial value if the
number of objects per slab changes?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-31 13:26 ` Christoph Lameter
@ 2009-03-31 17:21 ` David Rientjes
2009-03-31 17:24 ` Christoph Lameter
0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-03-31 17:21 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Tue, 31 Mar 2009, Christoph Lameter wrote:
> > @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s,
> > if (order > slub_max_order || order < slub_min_order)
> > return -EINVAL;
> >
> > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
> > calculate_sizes(s, order);
> > + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
> > return length;
> > }
> >
>
> Hmmm.. Still calculating the trash ratio based on existing objects per
> slab and then resetting the objects per slab to a different number.
> Shouldnt the trash_ratio simply be zapped to an initial value if the
> number of objects per slab changes?
>
Each cache with >= 20 objects per slab will get a default
slab_thrash_ratio of 20 in v2 of the series. If the order of a cache is
subsequently tuned, the default slab_thrash_ratio would be cleared without
knowledge to the user.
I'd agree that it should be cleared if the tunable had object units
instead of a ratio, but the ratio simply applies to any given order.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-31 17:21 ` David Rientjes
@ 2009-03-31 17:24 ` Christoph Lameter
2009-03-31 17:35 ` David Rientjes
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2009-03-31 17:24 UTC (permalink / raw)
To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Tue, 31 Mar 2009, David Rientjes wrote:
> I'd agree that it should be cleared if the tunable had object units
> instead of a ratio, but the ratio simply applies to any given order.
Right but resetting the order usually has a significant impact on the
threashing behavior (if it exists). Why would we keep the threshing ratio
that was calculated for another slab configuration?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio
2009-03-31 17:24 ` Christoph Lameter
@ 2009-03-31 17:35 ` David Rientjes
0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-31 17:35 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel
On Tue, 31 Mar 2009, Christoph Lameter wrote:
> > I'd agree that it should be cleared if the tunable had object units
> > instead of a ratio, but the ratio simply applies to any given order.
>
> Right but resetting the order usually has a significant impact on the
> threashing behavior (if it exists). Why would we keep the threshing ratio
> that was calculated for another slab configuration?
>
Either the default thrashing ratio is being used and is unchanged from
boot time in which case it will still apply to the new order, or the ratio
has already been changed and userspace is responsible for tuning it again
as the result of the new slab size.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 1/3] slub: add per-cache slab thrash ratio
@ 2009-03-26 9:42 David Rientjes
0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-03-26 9:42 UTC (permalink / raw)
To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel
Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the
percentage of a slab's objects that the fastpath must fulfill to not be
considered thrashing on a per-cpu basis[*].
"Thrashing" here is defined as the constant swapping of the cpu slab such
that the slowpath is followed the majority of the time because the
refilled cpu slab can only accommodate a small number of allocations.
This occurs when the object allocation and freeing pattern for a cache is
such that it spends more time swapping the cpu slab than fulfulling
fastpath allocations.
[*] A single instance of the thrash ratio not being reached in the
fastpath does not indicate the cpu cache is thrashing. A
pre-defined value will later be added to determine how many times
the ratio must not be reached before a cache is actually thrashing.
This is defined as a ratio based on the number of objects in a cache's
slab. This is automatically changed when /sys/kernel/slab/cache/order is
changed to reflect the same ratio.
The netperf TCP_RR benchmark illustrates slab thrashing very well with a
large number of threads. With a test length of 60 seconds, the following
thread counts were used to show the effect of the allocation and freeing
pattern of such a workload.
Before this patchset:
threads Transfer Rate (per sec)
10 66636.39
20 96311.02
40 103948.16
60 140977.62
80 166714.37
100 190431.35
200 244092.36
To identify the thrashing caches, the same workload was run with
CONFIG_SLUB_STATS enabled. The following caches are obviously performing
very poorly:
cache ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH
kmalloc-256 45186169 15930724 88289 61028526
kmalloc-2048 33507239 27541884 46525 61002601
After this patchset (both caches with slab_thrash_ratios of 20):
threads Transfer Rate (per sec)
10 68857.31
20 98335.04
40 124376.77
60 146014.14
80 177352.16
100 195467.61
200 245555.99
Although slabs may accommodate fewer objects than others when contiguous
memory cannot be allocated for a cache's order, the ratio is still based
on its configured `order' since slabs will exist on the partial list that
will be able to fulfill such a requirement.
The value is stored in terms of the number of objects that the ratio
represents, not the ratio itself. This avoids costly arithmetic in the
slowpath for a calculation that could otherwise be done only when
`slab_thrash_ratio' or `order' is changed.
This also will adjust the configured ratio to one that can actually be
represented in terms of whole numbers: for example, if slab_thrash_ratio
is set to 20 for a cache with 64 objects, the effective ratio is actually
3:16 (or 18.75%). This will be shown when reading the ratio since it is
better to represent the actual ratio instead of a pseudo substitute.
The slab_thrash_ratio for each cache do not have non-zero defaults
(yet?).
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/slub_def.h | 1 +
mm/slub.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -94,6 +94,7 @@ struct kmem_cache {
#ifdef CONFIG_SLUB_DEBUG
struct kobject kobj; /* For sysfs */
#endif
+ u16 min_free_watermark; /* Calculated from slab thrash ratio */
#ifdef CONFIG_NUMA
/*
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2190,6 +2190,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
unsigned long flags = s->flags;
unsigned long size = s->objsize;
unsigned long align = s->align;
+ u16 thrash_ratio = 0;
int order;
/*
@@ -2295,10 +2296,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
/*
* Determine the number of objects per slab
*/
+ if (oo_objects(s->oo))
+ thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo);
s->oo = oo_make(order, size);
s->min = oo_make(get_order(size), size);
if (oo_objects(s->oo) > oo_objects(s->max))
s->max = s->oo;
+ s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100;
return !!oo_objects(s->oo);
@@ -2320,6 +2324,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
goto error;
s->refcount = 1;
+ s->min_free_watermark = 0;
#ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 1000;
#endif
@@ -4089,6 +4094,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
SLAB_ATTR(remote_node_defrag_ratio);
#endif
+static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n",
+ s->min_free_watermark * 100 / oo_objects(s->oo));
+}
+
+static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf,
+ size_t length)
+{
+ unsigned long ratio;
+ int err;
+
+ err = strict_strtoul(buf, 10, &ratio);
+ if (err)
+ return err;
+
+ if (ratio <= 100)
+ s->min_free_watermark = oo_objects(s->oo) * ratio / 100;
+
+ return length;
+}
+SLAB_ATTR(slab_thrash_ratio);
+
#ifdef CONFIG_SLUB_STATS
static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
{
@@ -4172,6 +4200,7 @@ static struct attribute *slab_attrs[] = {
&shrink_attr.attr,
&alloc_calls_attr.attr,
&free_calls_attr.attr,
+ &slab_thrash_ratio_attr.attr,
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-03-31 17:37 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes
2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes
2009-03-30 14:41 ` Christoph Lameter
2009-03-30 20:29 ` David Rientjes
2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter
2009-03-30 20:22 ` David Rientjes
2009-03-30 21:20 ` Christoph Lameter
2009-03-31 7:13 ` Pekka Enberg
2009-03-31 8:23 ` David Rientjes
2009-03-31 8:49 ` Pekka Enberg
2009-03-31 13:23 ` Christoph Lameter
2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg
2009-03-30 8:41 ` David Rientjes
2009-03-30 15:54 ` Mel Gorman
2009-03-30 20:38 ` David Rientjes
2009-03-30 14:30 ` Christoph Lameter
2009-03-30 20:12 ` David Rientjes
2009-03-30 21:19 ` Christoph Lameter
2009-03-30 22:48 ` David Rientjes
2009-03-31 4:44 ` David Rientjes
2009-03-31 13:26 ` Christoph Lameter
2009-03-31 17:21 ` David Rientjes
2009-03-31 17:24 ` Christoph Lameter
2009-03-31 17:35 ` David Rientjes
2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg
-- strict thread matches above, loose matches on Subject: below --
2009-03-26 9:42 [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.