* [patch 0/3] slub partial list thrashing performance degradation @ 2009-03-30 5:43 David Rientjes 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes 2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg 0 siblings, 2 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel SLUB causes a performance degradation in comparison to SLAB when a workload has an object allocation and freeing pattern such that it spends more time in partial list handling than utilizing the fastpaths. This usually occurs when freeing to a non-cpu slab either due to remote cpu freeing or freeing to a full or partial slab. When the cpu slab is later replaced with the freeing slab, it can only satisfy a limited number of allocations before becoming full and requiring additional partial list handling. When the slowpath to fastpath ratio becomes high, this partial list handling causes the entire allocator to become very slow for the specific workload. The bash script at the end of this email (inline) illustrates the performance degradation well. It uses the netperf TCP_RR benchmark to measure transfer rates with various thread counts, each being multiples of the number of cores. The transfer rates are reported as an aggregate of the individual thread results. CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are performing quite poorly: cache ALLOC_FASTPATH ALLOC_SLOWPATH kmalloc-256 98125871 31585955 kmalloc-2048 77243698 52347453 cache FREE_FASTPATH FREE_SLOWPATH kmalloc-256 173624 129538000 kmalloc-2048 90520 129500630 The majority of slowpath allocations were from the partial list (30786261, or 97.5%, for kmalloc-256 and 51688159, or 98.7%, for kmalloc-2048). A large percentage of frees required the slab to be added back to the partial list. For kmalloc-256, 30786630 (23.8%) of slowpath frees required partial list handling. For kmalloc-2048, 51688697 (39.9%) of slowpath frees required partial list handling. On my 16-core machines with 64G of ram, these are the results: # threads SLAB SLUB SLUB+patchset 16 69892 71592 69505 32 126490 95373 119731 48 138050 113072 125014 64 169240 149043 158919 80 192294 172035 179679 96 197779 187849 192154 112 217283 204962 209988 128 229848 217547 223507 144 238550 232369 234565 160 250333 239871 244789 176 256878 242712 248971 192 261611 243182 255596 [ The SLUB+patchset results were attained with the latest git plus this patchset and slab_thrash_ratio set at 20 for both the kmalloc-256 and the kmalloc-2048 cache. ] Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/slub_def.h | 4 + mm/slub.c | 138 +++++++++++++++++++++++++++++++++++++++------- 2 files changed, 122 insertions(+), 20 deletions(-) #!/bin/bash TIME=60 # seconds HOSTNAME=<hostname> # netserver NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l) echo NR_CPUS=$NR_CPUS run_netperf() { for i in $(seq 1 $1); do netperf -H $HOSTNAME -t TCP_RR -l $TIME & done } ITERATIONS=0 while [ $ITERATIONS -lt 12 ]; do RATE=0 ITERATIONS=$[$ITERATIONS + 1] THREADS=$[$NR_CPUS * $ITERATIONS] RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }') for j in $RESULTS; do RATE=$[$RATE + ${j/.*}] done echo threads=$THREADS rate=$RATE done ^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes @ 2009-03-30 5:43 ` David Rientjes 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes ` (2 more replies) 2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg 1 sibling, 3 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the percentage of a slab's objects that the fastpath must fulfill to not be considered thrashing on a per-cpu basis[*]. "Thrashing" here is defined as the constant swapping of the cpu slab such that the slowpath is followed the majority of the time because the refilled cpu slab can only accommodate a small number of allocations. This occurs when the object allocation and freeing pattern for a cache is such that it spends more time swapping the cpu slab than fulfulling fastpath allocations. [*] A single instance of the thrash ratio not being reached in the fastpath does not indicate the cpu cache is thrashing. A pre-defined value will later be added to determine how many times the ratio must not be reached before a cache is actually thrashing. This is defined as a ratio based on the number of objects in a cache's slab. This is automatically changed when /sys/kernel/slab/cache/order is changed to reflect the same ratio. The netperf TCP_RR benchmark illustrates slab thrashing very well with a large number of threads. With a test length of 60 seconds, the following thread counts were used to show the effect of the allocation and freeing pattern of such a workload. Before this patchset: threads Transfer Rate (per sec) 16 71592 32 95373 48 113072 64 149043 80 172035 96 187849 112 204962 128 217547 144 232369 160 239871 176 242712 192 243182 To identify the thrashing caches, the same workload was run with CONFIG_SLUB_STATS enabled. The following caches are obviously performing very poorly: cache ALLOC_FASTPATH ALLOC_SLOWPATH kmalloc-256 98125871 31585955 kmalloc-2048 77243698 52347453 cache FREE_FASTPATH FREE_SLOWPATH kmalloc-256 173624 129538000 kmalloc-2048 90520 129500630 After this patchset (both caches with slab_thrash_ratios of 20): threads Transfer Rate (per sec) 16 69505 32 119731 48 125014 64 158919 80 179679 96 192154 112 209988 128 223507 144 234565 160 244789 176 248971 192 255596 Although slabs may accommodate fewer objects than others when contiguous memory cannot be allocated for a cache's order, the ratio is still based on its configured `order' since slabs will exist on the partial list that will be able to fulfill such a requirement. The value is stored in terms of the number of objects that the ratio represents, not the ratio itself. This avoids costly arithmetic in the slowpath for a calculation that could otherwise be done only when `slab_thrash_ratio' or `order' is changed. This also will adjust the configured ratio to one that can actually be represented in terms of whole numbers: for example, if slab_thrash_ratio is set to 20 for a cache with 64 objects, the effective ratio is actually 3:16 (or 18.75%). This will be shown when reading the ratio since it is better to represent the actual ratio instead of a pseudo substitute. The slab_thrash_ratio for each cache do not have non-zero defaults (yet?). Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/slub_def.h | 1 + mm/slub.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+), 0 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -94,6 +94,7 @@ struct kmem_cache { #ifdef CONFIG_SLUB_DEBUG struct kobject kobj; /* For sysfs */ #endif + u16 min_free_watermark; /* Calculated from slab thrash ratio */ #ifdef CONFIG_NUMA /* diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -2186,6 +2186,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) unsigned long flags = s->flags; unsigned long size = s->objsize; unsigned long align = s->align; + u16 thrash_ratio = 0; int order; /* @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) /* * Determine the number of objects per slab */ + if (oo_objects(s->oo)) + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); s->oo = oo_make(order, size); s->min = oo_make(get_order(size), size); if (oo_objects(s->oo) > oo_objects(s->max)) s->max = s->oo; + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; return !!oo_objects(s->oo); @@ -2321,6 +2325,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags, */ set_min_partial(s, ilog2(s->size)); s->refcount = 1; + s->min_free_watermark = 0; #ifdef CONFIG_NUMA s->remote_node_defrag_ratio = 1000; #endif @@ -4110,6 +4115,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s, SLAB_ATTR(remote_node_defrag_ratio); #endif +static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf) +{ + return sprintf(buf, "%d\n", + s->min_free_watermark * 100 / oo_objects(s->oo)); +} + +static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf, + size_t length) +{ + unsigned long ratio; + int err; + + err = strict_strtoul(buf, 10, &ratio); + if (err) + return err; + + if (ratio <= 100) + s->min_free_watermark = oo_objects(s->oo) * ratio / 100; + + return length; +} +SLAB_ATTR(slab_thrash_ratio); + #ifdef CONFIG_SLUB_STATS static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) { @@ -4194,6 +4222,7 @@ static struct attribute *slab_attrs[] = { &shrink_attr.attr, &alloc_calls_attr.attr, &free_calls_attr.attr, + &slab_thrash_ratio_attr.attr, #ifdef CONFIG_ZONE_DMA &cache_dma_attr.attr, #endif ^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes @ 2009-03-30 5:43 ` David Rientjes 2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes 2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter 2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg 2009-03-30 14:30 ` Christoph Lameter 2 siblings, 2 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel To determine when a slab is actually thrashing, it's insufficient to only look at the most recent allocation path. It's perfectly valid to swap the cpu slab with a partial slab that contains very few free objects if the goal is to quickly fill it since slub no longer needs to track such slabs. This is inefficient if an object will immediately be freed so that the full slab must be readded to the partial list. With certain object allocation and freeing patterns, it is possible to spend more time processing the partial list than utilizing the fastpaths. We already have a per-cache min_free_watermark setting that is configurable from userspace, which helps determine when we have excessive partial list handling. When a slab does not fulfill its watermark, it suggests that the cache may be thrashing. A pre-defined value, SLAB_THRASHING_THRESHOLD (which defaults to 3), is implemented to be used in conjunction with this statistic to determine when a slab is actually thrashing. Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter is incrememted. This counter is cleared whenever the slowpath is invoked. This tracks how many fastpath allocations the cpu slab has fulfilled before it must be refilled. When the slowpath must be invoked, a slowpath counter is incremented if the cpu slab did not fulfill the thrashing watermark. Otherwise, it is decremented. When the slowpath counter is greater than or equal to SLAB_THRASHING_THRESHOLD, the partial list is scanned for a slab that will be able to fulfill at least the number of objects required to not be considered thrashing. If no such slabs are available, the remote nodes are defragmented (if allowed) or a new slab is allocated. If a cpu slab must be swapped because the allocation is for a different node, both counters are cleared since this doesn't indicate any thrashing behavior. When /sys/kernel/slab/cache/slab_thrash_ratio is not set, this does not include any functional change other than the incrementing of a fastpath counter for the per-cpu cache. A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how many times a partial list was deferred because no slabs could satisfy the requisite number of objects for CONFIG_SLUB_STATS kernels. Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/slub_def.h | 3 + mm/slub.c | 93 ++++++++++++++++++++++++++++++++++++---------- 2 files changed, 76 insertions(+), 20 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -30,6 +30,7 @@ enum stat_item { DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */ DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */ ORDER_FALLBACK, /* Number of times fallback was necessary */ + DEFERRED_PARTIAL, /* Defer local partial list for lack of objs */ NR_SLUB_STAT_ITEMS }; struct kmem_cache_cpu { @@ -38,6 +39,8 @@ struct kmem_cache_cpu { int node; /* The node of the page (or -1 for debug) */ unsigned int offset; /* Freepointer offset (in word units) */ unsigned int objsize; /* Size of an object (from kmem_cache) */ + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -134,6 +134,19 @@ */ #define MAX_PARTIAL 10 +/* + * Number of successive slowpath allocations that have failed to allocate at + * least the number of objects in the fastpath to not be slab thrashing (as + * defined by the cache's slab thrash ratio). + * + * When an allocation follows the slowpath, it increments a counter in its cpu + * cache. If this counter exceeds the threshold, the partial list is scanned + * for a slab that will satisfy at least the cache's min_free_watermark in + * order for it to be used. Otherwise, the slab with the most free objects is + * used. + */ +#define SLAB_THRASHING_THRESHOLD 3 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1246,28 +1259,30 @@ static void remove_partial(struct kmem_cache *s, struct page *page) } /* - * Lock slab and remove from the partial list. + * Remove from the partial list. * - * Must hold list_lock. + * Must hold n->list_lock and slab_lock(page). */ -static inline int lock_and_freeze_slab(struct kmem_cache_node *n, - struct page *page) +static inline void freeze_slab(struct kmem_cache_node *n, struct page *page) { - if (slab_trylock(page)) { - list_del(&page->lru); - n->nr_partial--; - __SetPageSlubFrozen(page); - return 1; - } - return 0; + list_del(&page->lru); + n->nr_partial--; + __SetPageSlubFrozen(page); +} + +static inline int skip_partial(struct kmem_cache *s, struct page *page) +{ + return (page->objects - page->inuse) < s->min_free_watermark; } /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_node(struct kmem_cache *s, + struct kmem_cache_node *n, int thrashing) { struct page *page; + int locked = 0; /* * Racy check. If we mistakenly see no partial slabs then we @@ -1280,9 +1295,28 @@ static struct page *get_partial_node(struct kmem_cache_node *n) spin_lock(&n->list_lock); list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (slab_trylock(page)) { + /* + * When the cpu cache is partial list thrashing, it's + * necessary to replace the cpu slab with one that will + * accommodate at least s->min_free_watermark objects + * to avoid excessive list_lock contention and cache + * polluting. + * + * If no such slabs exist on the partial list, remote + * nodes are defragmented if allowed. + */ + if (thrashing && skip_partial(s, page)) { + slab_unlock(page); + locked++; + continue; + } + freeze_slab(n, page); goto out; + } page = NULL; + if (locked) + stat(get_cpu_slab(s, raw_smp_processor_id()), DEFERRED_PARTIAL); out: spin_unlock(&n->list_lock); return page; @@ -1291,7 +1325,8 @@ out: /* * Get a page from somewhere. Search in increasing NUMA distances. */ -static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) +static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, + int thrashing) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -1330,7 +1365,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > s->min_partial) { - page = get_partial_node(n); + page = get_partial_node(s, n, thrashing); if (page) return page; } @@ -1342,16 +1377,17 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) /* * Get a partial page, lock it and return it. */ -static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node, + int thrashing) { struct page *page; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_node(s, get_node(s, searchnode), thrashing); if (page || (flags & __GFP_THISNODE)) return page; - return get_any_partial(s, flags); + return get_any_partial(s, flags, thrashing); } /* @@ -1503,6 +1539,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void **object; struct page *new; + int is_empty = 0; /* We handle __GFP_ZERO in the caller */ gfpflags &= ~__GFP_ZERO; @@ -1511,7 +1548,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto new_slab; slab_lock(c->page); - if (unlikely(!node_match(c, node))) + is_empty = node_match(c, node); + if (unlikely(!is_empty)) goto another_slab; stat(c, ALLOC_REFILL); @@ -1536,7 +1574,17 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); + if (is_empty) { + if (c->fastpath_allocs < s->min_free_watermark) + c->slowpath_allocs++; + else if (c->slowpath_allocs) + c->slowpath_allocs--; + } else + c->slowpath_allocs = 0; + c->fastpath_allocs = 0; + + new = get_partial(s, gfpflags, node, + c->slowpath_allocs > SLAB_THRASHING_THRESHOLD); if (new) { c->page = new; stat(c, ALLOC_FROM_PARTIAL); @@ -1605,6 +1653,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, else { object = c->freelist; c->freelist = object[c->offset]; + c->fastpath_allocs++; stat(c, ALLOC_FASTPATH); } local_irq_restore(flags); @@ -1917,6 +1966,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, c->node = 0; c->offset = s->offset / sizeof(void *); c->objsize = s->objsize; + c->fastpath_allocs = 0; + c->slowpath_allocs = 0; #ifdef CONFIG_SLUB_STATS memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned)); #endif @@ -4193,6 +4244,7 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head); STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail); STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees); STAT_ATTR(ORDER_FALLBACK, order_fallback); +STAT_ATTR(DEFERRED_PARTIAL, deferred_partial); #endif static struct attribute *slab_attrs[] = { @@ -4248,6 +4300,7 @@ static struct attribute *slab_attrs[] = { &deactivate_to_tail_attr.attr, &deactivate_remote_frees_attr.attr, &order_fallback_attr.attr, + &deferred_partial_attr.attr, #endif NULL }; ^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 3/3] slub: sort parital list when thrashing 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes @ 2009-03-30 5:43 ` David Rientjes 2009-03-30 14:41 ` Christoph Lameter 2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter 1 sibling, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-30 5:43 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel Caches that are cpu slab thrashing will scan their entire partial list until a slab is found that will satisfy at least the requisite number of allocations so that it will not be considered thrashing (as defined by /sys/kernel/slab/cache/slab_thrash_ratio). The partial list can be extremely long and its scanning requires that list_lock is held for that particular node. This can be inefficient if slabs at the head of the list are not appropriate cpu slab replacements. When an object is freed, the number of free objects for its slab is calculated if the cpu cache is currently thrashing. If it can satisfy the requisite number of allocations so that the slab thrash ratio is exceeded, it is moved to the head of the partial list. This minimizes the time spent holding list_lock and can help cacheline optimizations for recently freed objects. Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- mm/slub.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -1258,6 +1258,13 @@ static void remove_partial(struct kmem_cache *s, struct page *page) spin_unlock(&n->list_lock); } +static void move_partial_to_head(struct kmem_cache_node *n, struct page *page) +{ + spin_lock(&n->list_lock); + list_move(&page->lru, &n->partial); + spin_unlock(&n->list_lock); +} + /* * Remove from the partial list. * @@ -1720,6 +1727,15 @@ checks_ok: if (unlikely(!prior)) { add_partial(get_node(s, page_to_nid(page)), page, 1); stat(c, FREE_ADD_PARTIAL); + } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) { + /* + * If the cache is actively slab thrashing, it's necessary to + * move partial slabs to the head of the list so there isn't + * excessive partial list scanning while holding list_lock. + */ + if (!skip_partial(s, page)) + move_partial_to_head(get_node(s, page_to_nid(page)), + page); } out_unlock: ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/3] slub: sort parital list when thrashing 2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes @ 2009-03-30 14:41 ` Christoph Lameter 2009-03-30 20:29 ` David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2009-03-30 14:41 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Sun, 29 Mar 2009, David Rientjes wrote: > @@ -1720,6 +1727,15 @@ checks_ok: > if (unlikely(!prior)) { > add_partial(get_node(s, page_to_nid(page)), page, 1); > stat(c, FREE_ADD_PARTIAL); > + } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) { > + /* > + * If the cache is actively slab thrashing, it's necessary to > + * move partial slabs to the head of the list so there isn't > + * excessive partial list scanning while holding list_lock. > + */ > + if (!skip_partial(s, page)) > + move_partial_to_head(get_node(s, page_to_nid(page)), > + page); > } > > out_unlock: > This again adds code to a pretty hot path. What is the impact of the additional hot path code? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/3] slub: sort parital list when thrashing 2009-03-30 14:41 ` Christoph Lameter @ 2009-03-30 20:29 ` David Rientjes 0 siblings, 0 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 20:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, Christoph Lameter wrote: > > @@ -1720,6 +1727,15 @@ checks_ok: > > if (unlikely(!prior)) { > > add_partial(get_node(s, page_to_nid(page)), page, 1); > > stat(c, FREE_ADD_PARTIAL); > > + } else if (c->slowpath_allocs >= SLAB_THRASHING_THRESHOLD) { > > + /* > > + * If the cache is actively slab thrashing, it's necessary to > > + * move partial slabs to the head of the list so there isn't > > + * excessive partial list scanning while holding list_lock. > > + */ > > + if (!skip_partial(s, page)) > > + move_partial_to_head(get_node(s, page_to_nid(page)), > > + page); > > } > > > > out_unlock: > > > > This again adds code to a pretty hot path. > > What is the impact of the additional hot path code? > I'll be collecting more data now that there's a general desire for a default slab_thrash_ratio value, so we'll implicitly see the performance degradation for non-thrashing slab caches. Mel had suggested a couple of benchmarks to try and my hypothesis is that they won't regress with a default ratio of 20 for all caches with >= 20 objects per slab (at least a 4 object threshold for determining thrash). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes 2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes @ 2009-03-30 14:37 ` Christoph Lameter 2009-03-30 20:22 ` David Rientjes 2009-03-31 7:13 ` Pekka Enberg 1 sibling, 2 replies; 28+ messages in thread From: Christoph Lameter @ 2009-03-30 14:37 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Sun, 29 Mar 2009, David Rientjes wrote: > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter > is incrememted. This counter is cleared whenever the slowpath is > invoked. This tracks how many fastpath allocations the cpu slab has > fulfilled before it must be refilled. That adds fastpath overhead and it shows for small objects in your tests. > A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how > many times a partial list was deferred because no slabs could satisfy > the requisite number of objects for CONFIG_SLUB_STATS kernels. Interesting approach. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter @ 2009-03-30 20:22 ` David Rientjes 2009-03-30 21:20 ` Christoph Lameter 2009-03-31 7:13 ` Pekka Enberg 1 sibling, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-30 20:22 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, Christoph Lameter wrote: > > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter > > is incrememted. This counter is cleared whenever the slowpath is > > invoked. This tracks how many fastpath allocations the cpu slab has > > fulfilled before it must be refilled. > > That adds fastpath overhead and it shows for small objects in your tests. > Indeed, which is unavoidable in this case. The only other way of tracking the "thrashing history" I can think of would be bitshifting a 1 for slowpath and 0 for fastpath, for example, into an unsigned long. That, however, requires a hamming weight calculation in the slowpath and doesn't scale nearly as well as simply an incrementing a counter. If there's other approaches on tracking such instances, I'd be interested to hear them. Btw, is cl@linux.com your new email address or are all linux-foundation.org emails going to eventually migrate to the new domain? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-30 20:22 ` David Rientjes @ 2009-03-30 21:20 ` Christoph Lameter 0 siblings, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2009-03-30 21:20 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, David Rientjes wrote: > Btw, is cl@linux.com your new email address or are all > linux-foundation.org emails going to eventually migrate to the new domain? Dont know. I just got the new email address (from the LF) and its shorter than linux-foundation.org so I started using it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter 2009-03-30 20:22 ` David Rientjes @ 2009-03-31 7:13 ` Pekka Enberg 2009-03-31 8:23 ` David Rientjes 2009-03-31 13:23 ` Christoph Lameter 1 sibling, 2 replies; 28+ messages in thread From: Pekka Enberg @ 2009-03-31 7:13 UTC (permalink / raw) To: Christoph Lameter; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel On Sun, 29 Mar 2009, David Rientjes wrote: > > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter > > is incrememted. This counter is cleared whenever the slowpath is > > invoked. This tracks how many fastpath allocations the cpu slab has > > fulfilled before it must be refilled. On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote: > That adds fastpath overhead and it shows for small objects in your tests. Yup, and looking at this: + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ How much do operations on u16 hurt on, say, x86-64? It's nice that sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that have bigger cache lines, the types could be wider. Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp btw? Pekka ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-31 7:13 ` Pekka Enberg @ 2009-03-31 8:23 ` David Rientjes 2009-03-31 8:49 ` Pekka Enberg 2009-03-31 13:23 ` Christoph Lameter 1 sibling, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-31 8:23 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel On Tue, 31 Mar 2009, Pekka Enberg wrote: > On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote: > > That adds fastpath overhead and it shows for small objects in your tests. > > Yup, and looking at this: > > + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ > + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ > > How much do operations on u16 hurt on, say, x86-64? As opposed to unsigned int? These simply use the word variations of the mov, test, cmp, and inc instructions instead of long. It's the same tradeoff when using the u16 slub fields within struct page except it's not strictly required in this instance because of size limitations, but rather for cacheline optimization. > It's nice that > sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that > have bigger cache lines, the types could be wider. > Right, this would not change the unpacked size of the struct whereas using unsigned int would. Since MAX_OBJS_PER_PAGE (which should really be renamed MAX_OBJS_PER_SLAB) ensures there is no overflow for u16 types, the only time fastpath_allocs would need to be wider is when the object size is sufficiently small and there had been frees to the cpu slab so that it overflows. In this circumstance, slowpath_allocs would simply be incremented and it would be corrected the next time a cpu slab does allocate beyond the threshold (SLAB_THRASHING_THRESHOLD should never be 1). The chance of reaching the threshold on successive fastpath counter overflows grows exponentially. And since slowpath_allocs will never overflow because it's capped at SLAB_THRASHING_THRESHOLD + 1 (the cpu slab will be refilled with a slab that will ensure slowpath_allocs will be decremented the next time the slowpath is invoked), overflow isn't an immediate problem with either. > Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp > btw? > This was removed in 4c93c355d5d563f300df7e61ef753d7a064411e9. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-31 8:23 ` David Rientjes @ 2009-03-31 8:49 ` Pekka Enberg 0 siblings, 0 replies; 28+ messages in thread From: Pekka Enberg @ 2009-03-31 8:49 UTC (permalink / raw) To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel Hi David, On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote: > > > That adds fastpath overhead and it shows for small objects in your tests. On Tue, 31 Mar 2009, Pekka Enberg wrote: > > Yup, and looking at this: > > > > + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ > > + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ > > > > How much do operations on u16 hurt on, say, x86-64? On Tue, 2009-03-31 at 01:23 -0700, David Rientjes wrote: > As opposed to unsigned int? These simply use the word variations of the > mov, test, cmp, and inc instructions instead of long. It's the same > tradeoff when using the u16 slub fields within struct page except it's not > strictly required in this instance because of size limitations, but rather > for cacheline optimization. I was thinking of partial register stalls. But looking at it on x86-64, the generated asm seems sane. I see tons of branch instructions, though, so simplifying this somehow: + if (is_empty) { + if (c->fastpath_allocs < s->min_free_watermark) + c->slowpath_allocs++; + else if (c->slowpath_allocs) + c->slowpath_allocs--; + } else + c->slowpath_allocs = 0; + c->fastpath_allocs = 0; would be most welcome. Pekka ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-31 7:13 ` Pekka Enberg 2009-03-31 8:23 ` David Rientjes @ 2009-03-31 13:23 ` Christoph Lameter 1 sibling, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2009-03-31 13:23 UTC (permalink / raw) To: Pekka Enberg; +Cc: David Rientjes, Nick Piggin, Martin Bligh, linux-kernel On Tue, 31 Mar 2009, Pekka Enberg wrote: > On Sun, 29 Mar 2009, David Rientjes wrote: > > > Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter > > > is incrememted. This counter is cleared whenever the slowpath is > > > invoked. This tracks how many fastpath allocations the cpu slab has > > > fulfilled before it must be refilled. > > On Mon, 2009-03-30 at 10:37 -0400, Christoph Lameter wrote: > > That adds fastpath overhead and it shows for small objects in your tests. > > Yup, and looking at this: > > + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ > + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ > > How much do operations on u16 hurt on, say, x86-64? It's nice that > sizeof(struct kmem_cache_cpu) is capped at 32 bytes but on CPUs that > have bigger cache lines, the types could be wider. > > Christoph, why is struct kmem_cache_cpu not __cacheline_aligned_in_smp > btw? Because it is either allocated using kmalloc and aligned to a cacheline boundary there or the kmem_cache_cpu entries come from the percpu definition for kmem_cache_cpu. There we dont need cacheline alignment since they are tightly packet. If the cacheline size is 64 bit then neighboring kmem_cache_cpus fit into one cacheline which reduces cache footprint and increased cache hotness. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes @ 2009-03-30 7:11 ` Pekka Enberg 2009-03-30 8:41 ` David Rientjes 2009-03-30 15:54 ` Mel Gorman 2009-03-30 14:30 ` Christoph Lameter 2 siblings, 2 replies; 28+ messages in thread From: Pekka Enberg @ 2009-03-30 7:11 UTC (permalink / raw) To: David Rientjes Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote: > The slab_thrash_ratio for each cache do not have non-zero defaults > (yet?). If we're going to merge this code, I think it would be better to put a non-zero default there; otherwise we won't be able to hit potential performance regressions or bugs. Furthermore, the optimization is not very useful on large scale if people need to enable it themselves. Maybe stick 20 there and run tbench, sysbench, et al to see if it makes a difference? I'm cc'ing Mel in case he has some suggestions how to test it. Pekka ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg @ 2009-03-30 8:41 ` David Rientjes 2009-03-30 15:54 ` Mel Gorman 1 sibling, 0 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 8:41 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel, Mel Gorman On Mon, 30 Mar 2009, Pekka Enberg wrote: > > The slab_thrash_ratio for each cache do not have non-zero defaults > > (yet?). > > If we're going to merge this code, I think it would be better to put a > non-zero default there; otherwise we won't be able to hit potential > performance regressions or bugs. Furthermore, the optimization is not > very useful on large scale if people need to enable it themselves. > > Maybe stick 20 there and run tbench, sysbench, et al to see if it makes > a difference? I'm cc'ing Mel in case he has some suggestions how to test > it. > It won't cause a regression if sane SLAB_THRASHING_THRESHOLD and slab_thrash_ratio values are set since the contention on list_lock will always be slower than utilizing a more free cpu slab when its thrashing. I agree that there should be a default value and I was originally going to propose the following as the fourth patch in the series, but I wanted to generate commentary on the approach first and there's always a hesitation when changing the default behavior of the entire allocator for workloads with very specific behavior that trigger this type of problem. The fact that we need a tunable for this is unfortunate, but there doesn't seem to be any other way to detect such situations and adjust the partial list handling so that list_lock isn't contended so much and the allocation slowpath to fastpath ratio isn't so high. I'd be interested to hear other people's approaches. --- diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -147,6 +147,12 @@ */ #define SLAB_THRASHING_THRESHOLD 3 +/* + * Default slab thrash ratio, used to define when a slab is thrashing for a + * particular cpu. + */ +#define DEFAULT_SLAB_THRASH_RATIO 20 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -2392,7 +2398,14 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags, */ set_min_partial(s, ilog2(s->size)); s->refcount = 1; - s->min_free_watermark = 0; + s->min_free_watermark = oo_objects(s->oo) * + DEFAULT_SLAB_THRASH_RATIO / 100; + /* + * It doesn't make sense to define a slab as thrashing if its threshold + * is fewer than 4 objects. + */ + if (s->min_free_watermark < 4) + s->min_free_watermark = 0; #ifdef CONFIG_NUMA s->remote_node_defrag_ratio = 1000; #endif ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg 2009-03-30 8:41 ` David Rientjes @ 2009-03-30 15:54 ` Mel Gorman 2009-03-30 20:38 ` David Rientjes 1 sibling, 1 reply; 28+ messages in thread From: Mel Gorman @ 2009-03-30 15:54 UTC (permalink / raw) To: Pekka Enberg Cc: David Rientjes, Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel On Mon, Mar 30, 2009 at 10:11:31AM +0300, Pekka Enberg wrote: > On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote: > > The slab_thrash_ratio for each cache do not have non-zero defaults > > (yet?). > > If we're going to merge this code, I think it would be better to put a > non-zero default there; otherwise we won't be able to hit potential > performance regressions or bugs. Furthermore, the optimization is not > very useful on large scale if people need to enable it themselves. > > Maybe stick 20 there and run tbench, sysbench, et al to see if it makes > a difference? I'm cc'ing Mel in case he has some suggestions how to test > it. > netperf and tbench will both pound the sl*b allocator far more than sysbench will in my opinion although I don't have figures on-hand to back that up. In the case of netperf, it might be particular obvious if the client is on one CPU and the server on another because I believe that means all allocs happen on one CPU and all frees on another. I have a vague concern that such a tunable needs to exist at all though and wonder what workloads it can hurt when set to 20 for example versus any other value. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 15:54 ` Mel Gorman @ 2009-03-30 20:38 ` David Rientjes 0 siblings, 0 replies; 28+ messages in thread From: David Rientjes @ 2009-03-30 20:38 UTC (permalink / raw) To: Mel Gorman Cc: Pekka Enberg, Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, Mel Gorman wrote: > netperf and tbench will both pound the sl*b allocator far more than sysbench > will in my opinion although I don't have figures on-hand to back that up. In > the case of netperf, it might be particular obvious if the client is on one > CPU and the server on another because I believe that means all allocs happen > on one CPU and all frees on another. > My results are for two 16-core 64G machines on the same rack, one running netserver and the other running netperf. > I have a vague concern that such a tunable needs to exist at all though > and wonder what workloads it can hurt when set to 20 for example versus any > other value. > The tunable needs to exist unless a counter proposal is made that fixes this slub performance degradation compared to using slab. I'd be very interested to hear other proposals on how to detect and remedy such situations in the allocator without the addition of a tunable. As I mentioned previously in response to Pekka, it won't cause a further regression if sane SLAB_THRASHING_THRESHOLD and slab_thrash_ratio values are chosen. The rules are pretty simple as described by the implementation: if a cpu slab can only allocate 20% of its objects three times in a row, we're going to choose a more free slab for the partial list while holding list_lock as opposed to constantly contending on it. This is particularly important for the netperf benchmark because the only cpu slabs that thrash are the ones with NUMA locality to the cpu taking the networking interrupt (because remote_node_defrag_ratio was unchanged from its default, meaning we avoid remote node defragmentation 98% of the time). I haven't measured the fastpath implications of non-thrashing caches (the increment in the alloc fastpath and the conditional in the alloc slowpath for partial list sorting) yet, but your suggested experiments should show that quite well. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes 2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg @ 2009-03-30 14:30 ` Christoph Lameter 2009-03-30 20:12 ` David Rientjes 2 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2009-03-30 14:30 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Sun, 29 Mar 2009, David Rientjes wrote: > @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) > /* > * Determine the number of objects per slab > */ > + if (oo_objects(s->oo)) > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > s->oo = oo_make(order, size); s->oo is set *after* you check it. Check oo_objects after the value has been set please. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 14:30 ` Christoph Lameter @ 2009-03-30 20:12 ` David Rientjes 2009-03-30 21:19 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-30 20:12 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, Christoph Lameter wrote: > > @@ -2291,10 +2292,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) > > /* > > * Determine the number of objects per slab > > */ > > + if (oo_objects(s->oo)) > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > > s->oo = oo_make(order, size); > > s->oo is set *after* you check it. Check oo_objects after the value has > been set please. > It's actually right the way I implemented it, oo_objects(s->oo) will be 0 when this is called for kmem_cache_open() meaning there is no preexisting slab_thrash_ratio. But this check is required when calculate_sizes() is called from order_store() to adjust the slab_thrash_ratio for the new objects per slab. The above check is saving the old thrash ratio so the new s->min_free_watermark value can be set following the oo_make(). This was mentioned in the changelog for this patch: The value is stored in terms of the number of objects that the ratio represents, not the ratio itself. This avoids costly arithmetic in the slowpath for a calculation that could otherwise be done only when `slab_thrash_ratio' or `order' is changed. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 20:12 ` David Rientjes @ 2009-03-30 21:19 ` Christoph Lameter 2009-03-30 22:48 ` David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2009-03-30 21:19 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, David Rientjes wrote: > > > + if (oo_objects(s->oo)) > > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > > > s->oo = oo_make(order, size); > > > > s->oo is set *after* you check it. Check oo_objects after the value has > > been set please. > > > > It's actually right the way I implemented it, oo_objects(s->oo) will be 0 > when this is called for kmem_cache_open() meaning there is no preexisting > slab_thrash_ratio. But this check is required when calculate_sizes() is > called from order_store() to adjust the slab_thrash_ratio for the new > objects per slab. The above check is saving the old thrash ratio so > the new s->min_free_watermark value can be set following the > oo_make(). This was mentioned in the changelog for this patch: > > The value is stored in terms of the number of objects that the > ratio represents, not the ratio itself. This avoids costly > arithmetic in the slowpath for a calculation that could otherwise > be done only when `slab_thrash_ratio' or `order' is changed. Then its the wrong place to set it. Initializations are done in kmem_cache_open() after calculate_sizes are called. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 21:19 ` Christoph Lameter @ 2009-03-30 22:48 ` David Rientjes 2009-03-31 4:44 ` David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-30 22:48 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, Christoph Lameter wrote: > > > > + if (oo_objects(s->oo)) > > > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > > > > s->oo = oo_make(order, size); > > > > > > s->oo is set *after* you check it. Check oo_objects after the value has > > > been set please. > > > > > > > It's actually right the way I implemented it, oo_objects(s->oo) will be 0 > > when this is called for kmem_cache_open() meaning there is no preexisting > > slab_thrash_ratio. But this check is required when calculate_sizes() is > > called from order_store() to adjust the slab_thrash_ratio for the new > > objects per slab. The above check is saving the old thrash ratio so > > the new s->min_free_watermark value can be set following the > > oo_make(). This was mentioned in the changelog for this patch: > > > > The value is stored in terms of the number of objects that the > > ratio represents, not the ratio itself. This avoids costly > > arithmetic in the slowpath for a calculation that could otherwise > > be done only when `slab_thrash_ratio' or `order' is changed. > > Then its the wrong place to set it. Initializations are done in > kmem_cache_open() after calculate_sizes are called. > The way the code is currently written, this acts as an initialization when there was no previous object count (i.e. its coming from kmem_cache_open()) and acts as an adjustment when there was a previous count (i.e. /sys/kernel/slab/cache/order was changed). The only way to avoid adding this to calculate_sizes() would be to add logic to order_store() to adjust the watermark when the order changes, but that duplicates the same calculation that is required for initialization if s->min_free_watermark does get a default value in kmem_cache_open() as Pekka suggested. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-30 22:48 ` David Rientjes @ 2009-03-31 4:44 ` David Rientjes 2009-03-31 13:26 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-31 4:44 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, David Rientjes wrote: > The way the code is currently written, this acts as an initialization when > there was no previous object count (i.e. its coming from > kmem_cache_open()) and acts as an adjustment when there was a previous > count (i.e. /sys/kernel/slab/cache/order was changed). The only way to > avoid adding this to calculate_sizes() would be to add logic to > order_store() to adjust the watermark when the order changes, but that > duplicates the same calculation that is required for initialization if > s->min_free_watermark does get a default value in kmem_cache_open() as > Pekka suggested. I applied the following to the patchset so that the initialization of the watermark is always separate from the updating as a result of changing /sys/kernel/slab/cache/order. Since the setting of a default watermark in kmem_cache_open() will require a calculation to find the corresponding min_free_watermark depending on the object size for a pre-defined default ratio, it will require the same calculation that is now in order_store(), but I agree it's simpler to understand and justifies the code duplication. --- diff --git a/mm/slub.c b/mm/slub.c index 76fa5a6..61ae612 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2187,7 +2187,6 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) unsigned long flags = s->flags; unsigned long size = s->objsize; unsigned long align = s->align; - u16 thrash_ratio = 0; int order; /* @@ -2293,13 +2292,10 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) /* * Determine the number of objects per slab */ - if (oo_objects(s->oo)) - thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); s->oo = oo_make(order, size); s->min = oo_make(get_order(size), size); if (oo_objects(s->oo) > oo_objects(s->max)) s->max = s->oo; - s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; return !!oo_objects(s->oo); @@ -3824,6 +3820,7 @@ static ssize_t order_store(struct kmem_cache *s, const char *buf, size_t length) { unsigned long order; + unsigned long thrash_ratio; int err; err = strict_strtoul(buf, 10, &order); @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s, if (order > slub_max_order || order < slub_min_order) return -EINVAL; + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); calculate_sizes(s, order); + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; return length; } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-31 4:44 ` David Rientjes @ 2009-03-31 13:26 ` Christoph Lameter 2009-03-31 17:21 ` David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2009-03-31 13:26 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Mon, 30 Mar 2009, David Rientjes wrote: > @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s, > if (order > slub_max_order || order < slub_min_order) > return -EINVAL; > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > calculate_sizes(s, order); > + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; > return length; > } > Hmmm.. Still calculating the trash ratio based on existing objects per slab and then resetting the objects per slab to a different number. Shouldnt the trash_ratio simply be zapped to an initial value if the number of objects per slab changes? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-31 13:26 ` Christoph Lameter @ 2009-03-31 17:21 ` David Rientjes 2009-03-31 17:24 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-31 17:21 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Tue, 31 Mar 2009, Christoph Lameter wrote: > > @@ -3833,7 +3830,9 @@ static ssize_t order_store(struct kmem_cache *s, > > if (order > slub_max_order || order < slub_min_order) > > return -EINVAL; > > > > + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); > > calculate_sizes(s, order); > > + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; > > return length; > > } > > > > Hmmm.. Still calculating the trash ratio based on existing objects per > slab and then resetting the objects per slab to a different number. > Shouldnt the trash_ratio simply be zapped to an initial value if the > number of objects per slab changes? > Each cache with >= 20 objects per slab will get a default slab_thrash_ratio of 20 in v2 of the series. If the order of a cache is subsequently tuned, the default slab_thrash_ratio would be cleared without knowledge to the user. I'd agree that it should be cleared if the tunable had object units instead of a ratio, but the ratio simply applies to any given order. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-31 17:21 ` David Rientjes @ 2009-03-31 17:24 ` Christoph Lameter 2009-03-31 17:35 ` David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2009-03-31 17:24 UTC (permalink / raw) To: David Rientjes; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Tue, 31 Mar 2009, David Rientjes wrote: > I'd agree that it should be cleared if the tunable had object units > instead of a ratio, but the ratio simply applies to any given order. Right but resetting the order usually has a significant impact on the threashing behavior (if it exists). Why would we keep the threshing ratio that was calculated for another slab configuration? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/3] slub: add per-cache slab thrash ratio 2009-03-31 17:24 ` Christoph Lameter @ 2009-03-31 17:35 ` David Rientjes 0 siblings, 0 replies; 28+ messages in thread From: David Rientjes @ 2009-03-31 17:35 UTC (permalink / raw) To: Christoph Lameter; +Cc: Pekka Enberg, Nick Piggin, Martin Bligh, linux-kernel On Tue, 31 Mar 2009, Christoph Lameter wrote: > > I'd agree that it should be cleared if the tunable had object units > > instead of a ratio, but the ratio simply applies to any given order. > > Right but resetting the order usually has a significant impact on the > threashing behavior (if it exists). Why would we keep the threshing ratio > that was calculated for another slab configuration? > Either the default thrashing ratio is being used and is unchanged from boot time in which case it will still apply to the new order, or the ratio has already been changed and userspace is responsible for tuning it again as the result of the new slab size. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 0/3] slub partial list thrashing performance degradation 2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes @ 2009-03-30 6:38 ` Pekka Enberg 1 sibling, 0 replies; 28+ messages in thread From: Pekka Enberg @ 2009-03-30 6:38 UTC (permalink / raw) To: David Rientjes; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel On Sun, 2009-03-29 at 22:43 -0700, David Rientjes wrote: > SLUB causes a performance degradation in comparison to SLAB when a > workload has an object allocation and freeing pattern such that it spends > more time in partial list handling than utilizing the fastpaths. Christoph, Nick, any objections to merging this? The patches look sane and the numbers convincing enough to me. Pekka ^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 1/3] slub: add per-cache slab thrash ratio @ 2009-03-26 9:42 David Rientjes 2009-03-26 9:42 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes 0 siblings, 1 reply; 28+ messages in thread From: David Rientjes @ 2009-03-26 9:42 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel Adds /sys/kernel/slab/cache/slab_thrash_ratio, which represents the percentage of a slab's objects that the fastpath must fulfill to not be considered thrashing on a per-cpu basis[*]. "Thrashing" here is defined as the constant swapping of the cpu slab such that the slowpath is followed the majority of the time because the refilled cpu slab can only accommodate a small number of allocations. This occurs when the object allocation and freeing pattern for a cache is such that it spends more time swapping the cpu slab than fulfulling fastpath allocations. [*] A single instance of the thrash ratio not being reached in the fastpath does not indicate the cpu cache is thrashing. A pre-defined value will later be added to determine how many times the ratio must not be reached before a cache is actually thrashing. This is defined as a ratio based on the number of objects in a cache's slab. This is automatically changed when /sys/kernel/slab/cache/order is changed to reflect the same ratio. The netperf TCP_RR benchmark illustrates slab thrashing very well with a large number of threads. With a test length of 60 seconds, the following thread counts were used to show the effect of the allocation and freeing pattern of such a workload. Before this patchset: threads Transfer Rate (per sec) 10 66636.39 20 96311.02 40 103948.16 60 140977.62 80 166714.37 100 190431.35 200 244092.36 To identify the thrashing caches, the same workload was run with CONFIG_SLUB_STATS enabled. The following caches are obviously performing very poorly: cache ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH kmalloc-256 45186169 15930724 88289 61028526 kmalloc-2048 33507239 27541884 46525 61002601 After this patchset (both caches with slab_thrash_ratios of 20): threads Transfer Rate (per sec) 10 68857.31 20 98335.04 40 124376.77 60 146014.14 80 177352.16 100 195467.61 200 245555.99 Although slabs may accommodate fewer objects than others when contiguous memory cannot be allocated for a cache's order, the ratio is still based on its configured `order' since slabs will exist on the partial list that will be able to fulfill such a requirement. The value is stored in terms of the number of objects that the ratio represents, not the ratio itself. This avoids costly arithmetic in the slowpath for a calculation that could otherwise be done only when `slab_thrash_ratio' or `order' is changed. This also will adjust the configured ratio to one that can actually be represented in terms of whole numbers: for example, if slab_thrash_ratio is set to 20 for a cache with 64 objects, the effective ratio is actually 3:16 (or 18.75%). This will be shown when reading the ratio since it is better to represent the actual ratio instead of a pseudo substitute. The slab_thrash_ratio for each cache do not have non-zero defaults (yet?). Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/slub_def.h | 1 + mm/slub.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+), 0 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -94,6 +94,7 @@ struct kmem_cache { #ifdef CONFIG_SLUB_DEBUG struct kobject kobj; /* For sysfs */ #endif + u16 min_free_watermark; /* Calculated from slab thrash ratio */ #ifdef CONFIG_NUMA /* diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -2190,6 +2190,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) unsigned long flags = s->flags; unsigned long size = s->objsize; unsigned long align = s->align; + u16 thrash_ratio = 0; int order; /* @@ -2295,10 +2296,13 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order) /* * Determine the number of objects per slab */ + if (oo_objects(s->oo)) + thrash_ratio = s->min_free_watermark * 100 / oo_objects(s->oo); s->oo = oo_make(order, size); s->min = oo_make(get_order(size), size); if (oo_objects(s->oo) > oo_objects(s->max)) s->max = s->oo; + s->min_free_watermark = oo_objects(s->oo) * thrash_ratio / 100; return !!oo_objects(s->oo); @@ -2320,6 +2324,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags, goto error; s->refcount = 1; + s->min_free_watermark = 0; #ifdef CONFIG_NUMA s->remote_node_defrag_ratio = 1000; #endif @@ -4089,6 +4094,29 @@ static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s, SLAB_ATTR(remote_node_defrag_ratio); #endif +static ssize_t slab_thrash_ratio_show(struct kmem_cache *s, char *buf) +{ + return sprintf(buf, "%d\n", + s->min_free_watermark * 100 / oo_objects(s->oo)); +} + +static ssize_t slab_thrash_ratio_store(struct kmem_cache *s, const char *buf, + size_t length) +{ + unsigned long ratio; + int err; + + err = strict_strtoul(buf, 10, &ratio); + if (err) + return err; + + if (ratio <= 100) + s->min_free_watermark = oo_objects(s->oo) * ratio / 100; + + return length; +} +SLAB_ATTR(slab_thrash_ratio); + #ifdef CONFIG_SLUB_STATS static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) { @@ -4172,6 +4200,7 @@ static struct attribute *slab_attrs[] = { &shrink_attr.attr, &alloc_calls_attr.attr, &free_calls_attr.attr, + &slab_thrash_ratio_attr.attr, #ifdef CONFIG_ZONE_DMA &cache_dma_attr.attr, #endif ^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 2/3] slub: scan partial list for free slabs when thrashing 2009-03-26 9:42 [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes @ 2009-03-26 9:42 ` David Rientjes 0 siblings, 0 replies; 28+ messages in thread From: David Rientjes @ 2009-03-26 9:42 UTC (permalink / raw) To: Pekka Enberg; +Cc: Christoph Lameter, Nick Piggin, Martin Bligh, linux-kernel To determine when a slab is actually thrashing, it's insufficient to only look at the most recent allocation path. It's perfectly valid to swap the cpu slab with a partial slab that contains very few free objects if the goal is to quickly fill it since slub no longer needs to track such slabs. This is inefficient if an object will immediately be freed so that the full slab must be readded to the partial list. With certain object allocation and freeing patterns, it is possible to spend more time processing the partial list than utilizing the fastpaths. We already have a per-cache min_free_watermark setting that is configurable from userspace, which helps determine when we have excessive partial list handling. When a slab does not fulfill its watermark, it suggests that the cache may be thrashing. A pre-defined value, SLAB_THRASHING_THRESHOLD (which defaults to 3), is implemented to be used in conjunction with this statistic to determine when a slab is actually thrashing. Whenever a cpu cache satisfies a fastpath allocation, a fastpath counter is incrememted. This counter is cleared whenever the slowpath is invoked. This tracks how many fastpath allocations the cpu slab has fulfilled before it must be refilled. When the slowpath must be invoked, a slowpath counter is incremented if the cpu slab did not fulfill the thrashing watermark. Otherwise, it is decremented. When the slowpath counter is greater than or equal to SLAB_THRASHING_THRESHOLD, the partial list is scanned for a slab that will be able to fulfill at least the number of objects required to not be considered thrashing. If no such slabs are available, the remote nodes are defragmented (if allowed) or a new slab is allocated. If a cpu slab must be swapped because the allocation is for a different node, both counters are cleared since this doesn't indicate any thrashing behavior. When /sys/kernel/slab/cache/slab_thrash_ratio is not set, this does not include any functional change other than the incrementing of a fastpath counter for the per-cpu cache. A new statistic, /sys/kernel/slab/cache/deferred_partial, indicates how many times a partial list was deferred because no slabs could satisfy the requisite number of objects for CONFIG_SLUB_STATS kernels. Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/slub_def.h | 3 + mm/slub.c | 93 ++++++++++++++++++++++++++++++++++++---------- 2 files changed, 76 insertions(+), 20 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -30,6 +30,7 @@ enum stat_item { DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */ DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */ ORDER_FALLBACK, /* Number of times fallback was necessary */ + DEFERRED_PARTIAL, /* Defer local partial list for lack of objs */ NR_SLUB_STAT_ITEMS }; struct kmem_cache_cpu { @@ -38,6 +39,8 @@ struct kmem_cache_cpu { int node; /* The node of the page (or -1 for debug) */ unsigned int offset; /* Freepointer offset (in word units) */ unsigned int objsize; /* Size of an object (from kmem_cache) */ + u16 fastpath_allocs; /* Consecutive fast allocs before slowpath */ + u16 slowpath_allocs; /* Consecutive slow allocs before watermark */ #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif diff --git a/mm/slub.c b/mm/slub.c --- a/mm/slub.c +++ b/mm/slub.c @@ -134,6 +134,19 @@ */ #define MAX_PARTIAL 10 +/* + * Number of successive slowpath allocations that have failed to allocate at + * least the number of objects in the fastpath to not be slab thrashing (as + * defined by the cache's slab thrash ratio). + * + * When an allocation follows the slowpath, it increments a counter in its cpu + * cache. If this counter exceeds the threshold, the partial list is scanned + * for a slab that will satisfy at least the cache's min_free_watermark in + * order for it to be used. Otherwise, the slab with the most free objects is + * used. + */ +#define SLAB_THRASHING_THRESHOLD 3 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1252,28 +1265,30 @@ static void remove_partial(struct kmem_cache *s, struct page *page) } /* - * Lock slab and remove from the partial list. + * Remove from the partial list. * - * Must hold list_lock. + * Must hold n->list_lock and slab_lock(page). */ -static inline int lock_and_freeze_slab(struct kmem_cache_node *n, - struct page *page) +static inline void freeze_slab(struct kmem_cache_node *n, struct page *page) { - if (slab_trylock(page)) { - list_del(&page->lru); - n->nr_partial--; - __SetPageSlubFrozen(page); - return 1; - } - return 0; + list_del(&page->lru); + n->nr_partial--; + __SetPageSlubFrozen(page); +} + +static inline int skip_partial(struct kmem_cache *s, struct page *page) +{ + return (page->objects - page->inuse) < s->min_free_watermark; } /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_node(struct kmem_cache *s, + struct kmem_cache_node *n, int thrashing) { struct page *page; + int locked = 0; /* * Racy check. If we mistakenly see no partial slabs then we @@ -1286,9 +1301,28 @@ static struct page *get_partial_node(struct kmem_cache_node *n) spin_lock(&n->list_lock); list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (slab_trylock(page)) { + /* + * When the cpu cache is partial list thrashing, it's + * necessary to replace the cpu slab with one that will + * accommodate at least s->min_free_watermark objects + * to avoid excessive list_lock contention and cache + * polluting. + * + * If no such slabs exist on the partial list, remote + * nodes are defragmented if allowed. + */ + if (thrashing && skip_partial(s, page)) { + slab_unlock(page); + locked++; + continue; + } + freeze_slab(n, page); goto out; + } page = NULL; + if (locked) + stat(get_cpu_slab(s, raw_smp_processor_id()), DEFERRED_PARTIAL); out: spin_unlock(&n->list_lock); return page; @@ -1297,7 +1331,8 @@ out: /* * Get a page from somewhere. Search in increasing NUMA distances. */ -static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) +static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, + int thrashing) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -1336,7 +1371,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > n->min_partial) { - page = get_partial_node(n); + page = get_partial_node(s, n, thrashing); if (page) return page; } @@ -1348,16 +1383,17 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) /* * Get a partial page, lock it and return it. */ -static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node, + int thrashing) { struct page *page; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_node(s, get_node(s, searchnode), thrashing); if (page || (flags & __GFP_THISNODE)) return page; - return get_any_partial(s, flags); + return get_any_partial(s, flags, thrashing); } /* @@ -1509,6 +1545,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void **object; struct page *new; + int is_empty = 0; /* We handle __GFP_ZERO in the caller */ gfpflags &= ~__GFP_ZERO; @@ -1517,7 +1554,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto new_slab; slab_lock(c->page); - if (unlikely(!node_match(c, node))) + is_empty = node_match(c, node); + if (unlikely(!is_empty)) goto another_slab; stat(c, ALLOC_REFILL); @@ -1542,7 +1580,17 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); + if (is_empty) { + if (c->fastpath_allocs < s->min_free_watermark) + c->slowpath_allocs++; + else if (c->slowpath_allocs) + c->slowpath_allocs--; + } else + c->slowpath_allocs = 0; + c->fastpath_allocs = 0; + + new = get_partial(s, gfpflags, node, + c->slowpath_allocs > SLAB_THRASHING_THRESHOLD); if (new) { c->page = new; stat(c, ALLOC_FROM_PARTIAL); @@ -1611,6 +1659,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, else { object = c->freelist; c->freelist = object[c->offset]; + c->fastpath_allocs++; stat(c, ALLOC_FASTPATH); } local_irq_restore(flags); @@ -1919,6 +1968,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, c->node = 0; c->offset = s->offset / sizeof(void *); c->objsize = s->objsize; + c->fastpath_allocs = 0; + c->slowpath_allocs = 0; #ifdef CONFIG_SLUB_STATS memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned)); #endif @@ -4172,6 +4223,7 @@ STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head); STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail); STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees); STAT_ATTR(ORDER_FALLBACK, order_fallback); +STAT_ATTR(DEFERRED_PARTIAL, deferred_partial); #endif static struct attribute *slab_attrs[] = { @@ -4226,6 +4278,7 @@ static struct attribute *slab_attrs[] = { &deactivate_to_tail_attr.attr, &deactivate_remote_frees_attr.attr, &order_fallback_attr.attr, + &deferred_partial_attr.attr, #endif NULL }; ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-03-31 17:37 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-03-30 5:43 [patch 0/3] slub partial list thrashing performance degradation David Rientjes 2009-03-30 5:43 ` [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes 2009-03-30 5:43 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes 2009-03-30 5:43 ` [patch 3/3] slub: sort parital list " David Rientjes 2009-03-30 14:41 ` Christoph Lameter 2009-03-30 20:29 ` David Rientjes 2009-03-30 14:37 ` [patch 2/3] slub: scan partial list for free slabs " Christoph Lameter 2009-03-30 20:22 ` David Rientjes 2009-03-30 21:20 ` Christoph Lameter 2009-03-31 7:13 ` Pekka Enberg 2009-03-31 8:23 ` David Rientjes 2009-03-31 8:49 ` Pekka Enberg 2009-03-31 13:23 ` Christoph Lameter 2009-03-30 7:11 ` [patch 1/3] slub: add per-cache slab thrash ratio Pekka Enberg 2009-03-30 8:41 ` David Rientjes 2009-03-30 15:54 ` Mel Gorman 2009-03-30 20:38 ` David Rientjes 2009-03-30 14:30 ` Christoph Lameter 2009-03-30 20:12 ` David Rientjes 2009-03-30 21:19 ` Christoph Lameter 2009-03-30 22:48 ` David Rientjes 2009-03-31 4:44 ` David Rientjes 2009-03-31 13:26 ` Christoph Lameter 2009-03-31 17:21 ` David Rientjes 2009-03-31 17:24 ` Christoph Lameter 2009-03-31 17:35 ` David Rientjes 2009-03-30 6:38 ` [patch 0/3] slub partial list thrashing performance degradation Pekka Enberg -- strict thread matches above, loose matches on Subject: below -- 2009-03-26 9:42 [patch 1/3] slub: add per-cache slab thrash ratio David Rientjes 2009-03-26 9:42 ` [patch 2/3] slub: scan partial list for free slabs when thrashing David Rientjes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.