* [PATCH v3 0/4] Per-container dcache limitation @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Hello, This series is just like v2, except it addresses Eric's comments regarding percpu variables. Let me know if there are further comments, and I'll promply address them as well. Otherwise, I feel this is ready for inclusion Glauber Costa (4): factor out single-shrinker code Keep nr_dentry per super block limit nr_dentries per superblock parse options in the vfs level fs/dcache.c | 47 ++++++++++++- fs/namespace.c | 105 ++++++++++++++++++++++++++ fs/super.c | 5 + include/linux/dcache.h | 4 + include/linux/fs.h | 3 + include/linux/shrinker.h | 6 ++ mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- 7 files changed, 267 insertions(+), 88 deletions(-) -- 1.7.6 ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 0/4] Per-container dcache limitation @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Glauber Costa Hello, This series is just like v2, except it addresses Eric's comments regarding percpu variables. Let me know if there are further comments, and I'll promply address them as well. Otherwise, I feel this is ready for inclusion Glauber Costa (4): factor out single-shrinker code Keep nr_dentry per super block limit nr_dentries per superblock parse options in the vfs level fs/dcache.c | 47 ++++++++++++- fs/namespace.c | 105 ++++++++++++++++++++++++++ fs/super.c | 5 + include/linux/dcache.h | 4 + include/linux/fs.h | 3 + include/linux/shrinker.h | 6 ++ mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- 7 files changed, 267 insertions(+), 88 deletions(-) -- 1.7.6 ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <1313334832-1150-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [PATCH v3 1/4] factor out single-shrinker code 2011-08-14 15:13 ` Glauber Costa @ 2011-08-14 15:13 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA While shrinking our caches, vmscan.c passes through all registered shrinkers, trying to free objects as it goes. We would like to do that individually for some caches, like the dcache, when certain conditions apply (for example, when we reach a soon-to-exist maximum allowed size) To avoid re-writing the same logic at more than one place, this patch factors out the shrink logic at shrink_one_shrinker(), that we can call from other places of the kernel. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- include/linux/shrinker.h | 6 ++ mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- 2 files changed, 104 insertions(+), 87 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 790651b..c5db650 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -39,4 +39,10 @@ struct shrinker { #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ extern void register_shrinker(struct shrinker *); extern void unregister_shrinker(struct shrinker *); + +unsigned long shrink_one_shrinker(struct shrinker *shrinker, + struct shrink_control *shrink, + unsigned long nr_pages_scanned, + unsigned long lru_pages); + #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 7ef6912..50dfc61 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker, } #define SHRINK_BATCH 128 +unsigned long shrink_one_shrinker(struct shrinker *shrinker, + struct shrink_control *shrink, + unsigned long nr_pages_scanned, + unsigned long lru_pages) +{ + unsigned long ret = 0; + unsigned long long delta; + unsigned long total_scan; + unsigned long max_pass; + int shrink_ret = 0; + long nr; + long new_nr; + long batch_size = shrinker->batch ? shrinker->batch + : SHRINK_BATCH; + + /* + * copy the current shrinker scan count into a local variable + * and zero it so that other concurrent shrinker invocations + * don't also do this scanning work. + */ + do { + nr = shrinker->nr; + } while (cmpxchg(&shrinker->nr, nr, 0) != nr); + + total_scan = nr; + max_pass = do_shrinker_shrink(shrinker, shrink, 0); + delta = (4 * nr_pages_scanned) / shrinker->seeks; + delta *= max_pass; + do_div(delta, lru_pages + 1); + total_scan += delta; + if (total_scan < 0) { + printk(KERN_ERR "shrink_slab: %pF negative objects to " + "delete nr=%ld\n", + shrinker->shrink, total_scan); + total_scan = max_pass; + } + + /* + * We need to avoid excessive windup on filesystem shrinkers + * due to large numbers of GFP_NOFS allocations causing the + * shrinkers to return -1 all the time. This results in a large + * nr being built up so when a shrink that can do some work + * comes along it empties the entire cache due to nr >>> + * max_pass. This is bad for sustaining a working set in + * memory. + * + * Hence only allow the shrinker to scan the entire cache when + * a large delta change is calculated directly. + */ + if (delta < max_pass / 4) + total_scan = min(total_scan, max_pass / 2); + + /* + * Avoid risking looping forever due to too large nr value: + * never try to free more than twice the estimate number of + * freeable entries. + */ + if (total_scan > max_pass * 2) + total_scan = max_pass * 2; + + trace_mm_shrink_slab_start(shrinker, shrink, nr, + nr_pages_scanned, lru_pages, + max_pass, delta, total_scan); + + while (total_scan >= batch_size) { + int nr_before; + + nr_before = do_shrinker_shrink(shrinker, shrink, 0); + shrink_ret = do_shrinker_shrink(shrinker, shrink, + batch_size); + if (shrink_ret == -1) + break; + if (shrink_ret < nr_before) + ret += nr_before - shrink_ret; + count_vm_events(SLABS_SCANNED, batch_size); + total_scan -= batch_size; + + cond_resched(); + } + + /* + * move the unused scan count back into the shrinker in a + * manner that handles concurrent updates. If we exhausted the + * scan, there is no need to do an update. + */ + do { + nr = shrinker->nr; + new_nr = total_scan + nr; + if (total_scan <= 0) + break; + } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); + + trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); + return ret; +} + /* * Call the shrink functions to age shrinkable caches * @@ -247,93 +343,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, } list_for_each_entry(shrinker, &shrinker_list, list) { - unsigned long long delta; - unsigned long total_scan; - unsigned long max_pass; - int shrink_ret = 0; - long nr; - long new_nr; - long batch_size = shrinker->batch ? shrinker->batch - : SHRINK_BATCH; - - /* - * copy the current shrinker scan count into a local variable - * and zero it so that other concurrent shrinker invocations - * don't also do this scanning work. - */ - do { - nr = shrinker->nr; - } while (cmpxchg(&shrinker->nr, nr, 0) != nr); - - total_scan = nr; - max_pass = do_shrinker_shrink(shrinker, shrink, 0); - delta = (4 * nr_pages_scanned) / shrinker->seeks; - delta *= max_pass; - do_div(delta, lru_pages + 1); - total_scan += delta; - if (total_scan < 0) { - printk(KERN_ERR "shrink_slab: %pF negative objects to " - "delete nr=%ld\n", - shrinker->shrink, total_scan); - total_scan = max_pass; - } - - /* - * We need to avoid excessive windup on filesystem shrinkers - * due to large numbers of GFP_NOFS allocations causing the - * shrinkers to return -1 all the time. This results in a large - * nr being built up so when a shrink that can do some work - * comes along it empties the entire cache due to nr >>> - * max_pass. This is bad for sustaining a working set in - * memory. - * - * Hence only allow the shrinker to scan the entire cache when - * a large delta change is calculated directly. - */ - if (delta < max_pass / 4) - total_scan = min(total_scan, max_pass / 2); - - /* - * Avoid risking looping forever due to too large nr value: - * never try to free more than twice the estimate number of - * freeable entries. - */ - if (total_scan > max_pass * 2) - total_scan = max_pass * 2; - - trace_mm_shrink_slab_start(shrinker, shrink, nr, - nr_pages_scanned, lru_pages, - max_pass, delta, total_scan); - - while (total_scan >= batch_size) { - int nr_before; - - nr_before = do_shrinker_shrink(shrinker, shrink, 0); - shrink_ret = do_shrinker_shrink(shrinker, shrink, - batch_size); - if (shrink_ret == -1) - break; - if (shrink_ret < nr_before) - ret += nr_before - shrink_ret; - count_vm_events(SLABS_SCANNED, batch_size); - total_scan -= batch_size; - - cond_resched(); - } - - /* - * move the unused scan count back into the shrinker in a - * manner that handles concurrent updates. If we exhausted the - * scan, there is no need to do an update. - */ - do { - nr = shrinker->nr; - new_nr = total_scan + nr; - if (total_scan <= 0) - break; - } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); - - trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); + ret += shrink_one_shrinker(shrinker, shrink, + nr_pages_scanned, lru_pages); } up_read(&shrinker_rwsem); out: -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 1/4] factor out single-shrinker code @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Glauber Costa, Eric Dumazet While shrinking our caches, vmscan.c passes through all registered shrinkers, trying to free objects as it goes. We would like to do that individually for some caches, like the dcache, when certain conditions apply (for example, when we reach a soon-to-exist maximum allowed size) To avoid re-writing the same logic at more than one place, this patch factors out the shrink logic at shrink_one_shrinker(), that we can call from other places of the kernel. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Dave Chinner <david@fromorbit.com> CC: Eric Dumazet <eric.dumazet@gmail.com> --- include/linux/shrinker.h | 6 ++ mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- 2 files changed, 104 insertions(+), 87 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 790651b..c5db650 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -39,4 +39,10 @@ struct shrinker { #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ extern void register_shrinker(struct shrinker *); extern void unregister_shrinker(struct shrinker *); + +unsigned long shrink_one_shrinker(struct shrinker *shrinker, + struct shrink_control *shrink, + unsigned long nr_pages_scanned, + unsigned long lru_pages); + #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 7ef6912..50dfc61 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker, } #define SHRINK_BATCH 128 +unsigned long shrink_one_shrinker(struct shrinker *shrinker, + struct shrink_control *shrink, + unsigned long nr_pages_scanned, + unsigned long lru_pages) +{ + unsigned long ret = 0; + unsigned long long delta; + unsigned long total_scan; + unsigned long max_pass; + int shrink_ret = 0; + long nr; + long new_nr; + long batch_size = shrinker->batch ? shrinker->batch + : SHRINK_BATCH; + + /* + * copy the current shrinker scan count into a local variable + * and zero it so that other concurrent shrinker invocations + * don't also do this scanning work. + */ + do { + nr = shrinker->nr; + } while (cmpxchg(&shrinker->nr, nr, 0) != nr); + + total_scan = nr; + max_pass = do_shrinker_shrink(shrinker, shrink, 0); + delta = (4 * nr_pages_scanned) / shrinker->seeks; + delta *= max_pass; + do_div(delta, lru_pages + 1); + total_scan += delta; + if (total_scan < 0) { + printk(KERN_ERR "shrink_slab: %pF negative objects to " + "delete nr=%ld\n", + shrinker->shrink, total_scan); + total_scan = max_pass; + } + + /* + * We need to avoid excessive windup on filesystem shrinkers + * due to large numbers of GFP_NOFS allocations causing the + * shrinkers to return -1 all the time. This results in a large + * nr being built up so when a shrink that can do some work + * comes along it empties the entire cache due to nr >>> + * max_pass. This is bad for sustaining a working set in + * memory. + * + * Hence only allow the shrinker to scan the entire cache when + * a large delta change is calculated directly. + */ + if (delta < max_pass / 4) + total_scan = min(total_scan, max_pass / 2); + + /* + * Avoid risking looping forever due to too large nr value: + * never try to free more than twice the estimate number of + * freeable entries. + */ + if (total_scan > max_pass * 2) + total_scan = max_pass * 2; + + trace_mm_shrink_slab_start(shrinker, shrink, nr, + nr_pages_scanned, lru_pages, + max_pass, delta, total_scan); + + while (total_scan >= batch_size) { + int nr_before; + + nr_before = do_shrinker_shrink(shrinker, shrink, 0); + shrink_ret = do_shrinker_shrink(shrinker, shrink, + batch_size); + if (shrink_ret == -1) + break; + if (shrink_ret < nr_before) + ret += nr_before - shrink_ret; + count_vm_events(SLABS_SCANNED, batch_size); + total_scan -= batch_size; + + cond_resched(); + } + + /* + * move the unused scan count back into the shrinker in a + * manner that handles concurrent updates. If we exhausted the + * scan, there is no need to do an update. + */ + do { + nr = shrinker->nr; + new_nr = total_scan + nr; + if (total_scan <= 0) + break; + } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); + + trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); + return ret; +} + /* * Call the shrink functions to age shrinkable caches * @@ -247,93 +343,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, } list_for_each_entry(shrinker, &shrinker_list, list) { - unsigned long long delta; - unsigned long total_scan; - unsigned long max_pass; - int shrink_ret = 0; - long nr; - long new_nr; - long batch_size = shrinker->batch ? shrinker->batch - : SHRINK_BATCH; - - /* - * copy the current shrinker scan count into a local variable - * and zero it so that other concurrent shrinker invocations - * don't also do this scanning work. - */ - do { - nr = shrinker->nr; - } while (cmpxchg(&shrinker->nr, nr, 0) != nr); - - total_scan = nr; - max_pass = do_shrinker_shrink(shrinker, shrink, 0); - delta = (4 * nr_pages_scanned) / shrinker->seeks; - delta *= max_pass; - do_div(delta, lru_pages + 1); - total_scan += delta; - if (total_scan < 0) { - printk(KERN_ERR "shrink_slab: %pF negative objects to " - "delete nr=%ld\n", - shrinker->shrink, total_scan); - total_scan = max_pass; - } - - /* - * We need to avoid excessive windup on filesystem shrinkers - * due to large numbers of GFP_NOFS allocations causing the - * shrinkers to return -1 all the time. This results in a large - * nr being built up so when a shrink that can do some work - * comes along it empties the entire cache due to nr >>> - * max_pass. This is bad for sustaining a working set in - * memory. - * - * Hence only allow the shrinker to scan the entire cache when - * a large delta change is calculated directly. - */ - if (delta < max_pass / 4) - total_scan = min(total_scan, max_pass / 2); - - /* - * Avoid risking looping forever due to too large nr value: - * never try to free more than twice the estimate number of - * freeable entries. - */ - if (total_scan > max_pass * 2) - total_scan = max_pass * 2; - - trace_mm_shrink_slab_start(shrinker, shrink, nr, - nr_pages_scanned, lru_pages, - max_pass, delta, total_scan); - - while (total_scan >= batch_size) { - int nr_before; - - nr_before = do_shrinker_shrink(shrinker, shrink, 0); - shrink_ret = do_shrinker_shrink(shrinker, shrink, - batch_size); - if (shrink_ret == -1) - break; - if (shrink_ret < nr_before) - ret += nr_before - shrink_ret; - count_vm_events(SLABS_SCANNED, batch_size); - total_scan -= batch_size; - - cond_resched(); - } - - /* - * move the unused scan count back into the shrinker in a - * manner that handles concurrent updates. If we exhausted the - * scan, there is no need to do an update. - */ - do { - nr = shrinker->nr; - new_nr = total_scan + nr; - if (total_scan <= 0) - break; - } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); - - trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); + ret += shrink_one_shrinker(shrinker, shrink, + nr_pages_scanned, lru_pages); } up_read(&shrinker_rwsem); out: -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v3 1/4] factor out single-shrinker code 2011-08-14 15:13 ` Glauber Costa (?) @ 2011-08-15 6:43 ` Dave Chinner -1 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 6:43 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet On Sun, Aug 14, 2011 at 07:13:49PM +0400, Glauber Costa wrote: > While shrinking our caches, vmscan.c passes through all > registered shrinkers, trying to free objects as it goes. > > We would like to do that individually for some caches, > like the dcache, when certain conditions apply (for > example, when we reach a soon-to-exist maximum allowed size) > > To avoid re-writing the same logic at more than one place, > this patch factors out the shrink logic at shrink_one_shrinker(), > that we can call from other places of the kernel. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Dave Chinner <david@fromorbit.com> > CC: Eric Dumazet <eric.dumazet@gmail.com> > --- > include/linux/shrinker.h | 6 ++ > mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- > 2 files changed, 104 insertions(+), 87 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 790651b..c5db650 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -39,4 +39,10 @@ struct shrinker { > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ > extern void register_shrinker(struct shrinker *); > extern void unregister_shrinker(struct shrinker *); > + > +unsigned long shrink_one_shrinker(struct shrinker *shrinker, > + struct shrink_control *shrink, > + unsigned long nr_pages_scanned, > + unsigned long lru_pages); > + > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7ef6912..50dfc61 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker, > } > > #define SHRINK_BATCH 128 > +unsigned long shrink_one_shrinker(struct shrinker *shrinker, > + struct shrink_control *shrink, > + unsigned long nr_pages_scanned, > + unsigned long lru_pages) > +{ This isn't the right place to cut the code apart. The act of shrinking a cache is separate to the act of calculating how much to shrink. you're taking the vmscan interface and trying to hack around that when direct calls to a single shrinker is needed..... > + unsigned long ret = 0; > + unsigned long long delta; > + unsigned long total_scan; > + unsigned long max_pass; > + int shrink_ret = 0; > + long nr; > + long new_nr; > + long batch_size = shrinker->batch ? shrinker->batch > + : SHRINK_BATCH; > + > + /* > + * copy the current shrinker scan count into a local variable > + * and zero it so that other concurrent shrinker invocations > + * don't also do this scanning work. > + */ > + do { > + nr = shrinker->nr; > + } while (cmpxchg(&shrinker->nr, nr, 0) != nr); > + > + total_scan = nr; > + max_pass = do_shrinker_shrink(shrinker, shrink, 0); > + delta = (4 * nr_pages_scanned) / shrinker->seeks; > + delta *= max_pass; > + do_div(delta, lru_pages + 1); > + total_scan += delta; > + if (total_scan < 0) { > + printk(KERN_ERR "shrink_slab: %pF negative objects to " > + "delete nr=%ld\n", > + shrinker->shrink, total_scan); > + total_scan = max_pass; > + } > + > + /* > + * We need to avoid excessive windup on filesystem shrinkers > + * due to large numbers of GFP_NOFS allocations causing the > + * shrinkers to return -1 all the time. This results in a large > + * nr being built up so when a shrink that can do some work > + * comes along it empties the entire cache due to nr >>> > + * max_pass. This is bad for sustaining a working set in > + * memory. > + * > + * Hence only allow the shrinker to scan the entire cache when > + * a large delta change is calculated directly. > + */ > + if (delta < max_pass / 4) > + total_scan = min(total_scan, max_pass / 2); > + > + /* > + * Avoid risking looping forever due to too large nr value: > + * never try to free more than twice the estimate number of > + * freeable entries. > + */ > + if (total_scan > max_pass * 2) > + total_scan = max_pass * 2; > + > + trace_mm_shrink_slab_start(shrinker, shrink, nr, > + nr_pages_scanned, lru_pages, > + max_pass, delta, total_scan); > + This bit from here: > + while (total_scan >= batch_size) { > + int nr_before; > + > + nr_before = do_shrinker_shrink(shrinker, shrink, 0); > + shrink_ret = do_shrinker_shrink(shrinker, shrink, > + batch_size); > + if (shrink_ret == -1) > + break; > + if (shrink_ret < nr_before) > + ret += nr_before - shrink_ret; > + count_vm_events(SLABS_SCANNED, batch_size); > + total_scan -= batch_size; > + > + cond_resched(); > + } to here is the only common code. It takes a locked shrinker and a set up scan_control structure. As it is, I've got a set of patches that I'm working on that fix the garbage srhinker API and abstract this section of the code correctly for external users. I'll try to get that finished for review later this week. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <1313334832-1150-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 1/4] factor out single-shrinker code [not found] ` <1313334832-1150-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-15 6:43 ` Dave Chinner 0 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 6:43 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sun, Aug 14, 2011 at 07:13:49PM +0400, Glauber Costa wrote: > While shrinking our caches, vmscan.c passes through all > registered shrinkers, trying to free objects as it goes. > > We would like to do that individually for some caches, > like the dcache, when certain conditions apply (for > example, when we reach a soon-to-exist maximum allowed size) > > To avoid re-writing the same logic at more than one place, > this patch factors out the shrink logic at shrink_one_shrinker(), > that we can call from other places of the kernel. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > --- > include/linux/shrinker.h | 6 ++ > mm/vmscan.c | 185 ++++++++++++++++++++++++---------------------- > 2 files changed, 104 insertions(+), 87 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 790651b..c5db650 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -39,4 +39,10 @@ struct shrinker { > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ > extern void register_shrinker(struct shrinker *); > extern void unregister_shrinker(struct shrinker *); > + > +unsigned long shrink_one_shrinker(struct shrinker *shrinker, > + struct shrink_control *shrink, > + unsigned long nr_pages_scanned, > + unsigned long lru_pages); > + > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7ef6912..50dfc61 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker, > } > > #define SHRINK_BATCH 128 > +unsigned long shrink_one_shrinker(struct shrinker *shrinker, > + struct shrink_control *shrink, > + unsigned long nr_pages_scanned, > + unsigned long lru_pages) > +{ This isn't the right place to cut the code apart. The act of shrinking a cache is separate to the act of calculating how much to shrink. you're taking the vmscan interface and trying to hack around that when direct calls to a single shrinker is needed..... > + unsigned long ret = 0; > + unsigned long long delta; > + unsigned long total_scan; > + unsigned long max_pass; > + int shrink_ret = 0; > + long nr; > + long new_nr; > + long batch_size = shrinker->batch ? shrinker->batch > + : SHRINK_BATCH; > + > + /* > + * copy the current shrinker scan count into a local variable > + * and zero it so that other concurrent shrinker invocations > + * don't also do this scanning work. > + */ > + do { > + nr = shrinker->nr; > + } while (cmpxchg(&shrinker->nr, nr, 0) != nr); > + > + total_scan = nr; > + max_pass = do_shrinker_shrink(shrinker, shrink, 0); > + delta = (4 * nr_pages_scanned) / shrinker->seeks; > + delta *= max_pass; > + do_div(delta, lru_pages + 1); > + total_scan += delta; > + if (total_scan < 0) { > + printk(KERN_ERR "shrink_slab: %pF negative objects to " > + "delete nr=%ld\n", > + shrinker->shrink, total_scan); > + total_scan = max_pass; > + } > + > + /* > + * We need to avoid excessive windup on filesystem shrinkers > + * due to large numbers of GFP_NOFS allocations causing the > + * shrinkers to return -1 all the time. This results in a large > + * nr being built up so when a shrink that can do some work > + * comes along it empties the entire cache due to nr >>> > + * max_pass. This is bad for sustaining a working set in > + * memory. > + * > + * Hence only allow the shrinker to scan the entire cache when > + * a large delta change is calculated directly. > + */ > + if (delta < max_pass / 4) > + total_scan = min(total_scan, max_pass / 2); > + > + /* > + * Avoid risking looping forever due to too large nr value: > + * never try to free more than twice the estimate number of > + * freeable entries. > + */ > + if (total_scan > max_pass * 2) > + total_scan = max_pass * 2; > + > + trace_mm_shrink_slab_start(shrinker, shrink, nr, > + nr_pages_scanned, lru_pages, > + max_pass, delta, total_scan); > + This bit from here: > + while (total_scan >= batch_size) { > + int nr_before; > + > + nr_before = do_shrinker_shrink(shrinker, shrink, 0); > + shrink_ret = do_shrinker_shrink(shrinker, shrink, > + batch_size); > + if (shrink_ret == -1) > + break; > + if (shrink_ret < nr_before) > + ret += nr_before - shrink_ret; > + count_vm_events(SLABS_SCANNED, batch_size); > + total_scan -= batch_size; > + > + cond_resched(); > + } to here is the only common code. It takes a locked shrinker and a set up scan_control structure. As it is, I've got a set of patches that I'm working on that fix the garbage srhinker API and abstract this section of the code correctly for external users. I'll try to get that finished for review later this week. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 2/4] Keep nr_dentry per super block 2011-08-14 15:13 ` Glauber Costa @ 2011-08-14 15:13 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Now that we have per-sb shrinkers, it makes sense to have nr_dentries stored per sb as well. We turn them into per-cpu counters so we can keep acessing them without locking. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- fs/dcache.c | 12 +++++++++++- fs/super.c | 4 ++++ include/linux/fs.h | 2 ++ 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index b05aac3..815d9fd 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -151,7 +151,13 @@ static void __d_free(struct rcu_head *head) static void d_free(struct dentry *dentry) { BUG_ON(dentry->d_count); + /* + * It is cheaper to keep a global counter separate + * then to scan through all superblocks when needed + */ this_cpu_dec(nr_dentry); + percpu_counter_dec(&dentry->d_sb->s_nr_dentry); + if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); @@ -1224,8 +1230,12 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) INIT_LIST_HEAD(&dentry->d_alias); INIT_LIST_HEAD(&dentry->d_u.d_child); d_set_d_op(dentry, dentry->d_sb->s_d_op); - + /* + * It is cheaper to keep a global counter separate + * then to scan through all superblocks when needed + */ this_cpu_inc(nr_dentry); + percpu_counter_inc(&dentry->d_sb->s_nr_dentry); return dentry; } diff --git a/fs/super.c b/fs/super.c index 3f56a26..e95ac4f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type) s = NULL; goto out; } + + percpu_counter_init(&s->s_nr_dentry, 0); + #ifdef CONFIG_SMP s->s_files = alloc_percpu(struct list_head); if (!s->s_files) { @@ -199,6 +202,7 @@ static inline void destroy_super(struct super_block *s) #ifdef CONFIG_SMP free_percpu(s->s_files); #endif + percpu_counter_destroy(&s->s_nr_dentry); security_sb_free(s); kfree(s->s_subtype); kfree(s->s_options); diff --git a/include/linux/fs.h b/include/linux/fs.h index f23bcb7..8150f52 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1399,6 +1399,8 @@ struct super_block { struct list_head s_dentry_lru; /* unused dentry lru */ int s_nr_dentry_unused; /* # of dentry on lru */ + struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ + /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; struct list_head s_inode_lru; /* unused inode lru */ -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 2/4] Keep nr_dentry per super block @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Glauber Costa, Eric Dumazet Now that we have per-sb shrinkers, it makes sense to have nr_dentries stored per sb as well. We turn them into per-cpu counters so we can keep acessing them without locking. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Dave Chinner <david@fromorbit.com> CC: Eric Dumazet <eric.dumazet@gmail.com> --- fs/dcache.c | 12 +++++++++++- fs/super.c | 4 ++++ include/linux/fs.h | 2 ++ 3 files changed, 17 insertions(+), 1 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index b05aac3..815d9fd 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -151,7 +151,13 @@ static void __d_free(struct rcu_head *head) static void d_free(struct dentry *dentry) { BUG_ON(dentry->d_count); + /* + * It is cheaper to keep a global counter separate + * then to scan through all superblocks when needed + */ this_cpu_dec(nr_dentry); + percpu_counter_dec(&dentry->d_sb->s_nr_dentry); + if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); @@ -1224,8 +1230,12 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) INIT_LIST_HEAD(&dentry->d_alias); INIT_LIST_HEAD(&dentry->d_u.d_child); d_set_d_op(dentry, dentry->d_sb->s_d_op); - + /* + * It is cheaper to keep a global counter separate + * then to scan through all superblocks when needed + */ this_cpu_inc(nr_dentry); + percpu_counter_inc(&dentry->d_sb->s_nr_dentry); return dentry; } diff --git a/fs/super.c b/fs/super.c index 3f56a26..e95ac4f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type) s = NULL; goto out; } + + percpu_counter_init(&s->s_nr_dentry, 0); + #ifdef CONFIG_SMP s->s_files = alloc_percpu(struct list_head); if (!s->s_files) { @@ -199,6 +202,7 @@ static inline void destroy_super(struct super_block *s) #ifdef CONFIG_SMP free_percpu(s->s_files); #endif + percpu_counter_destroy(&s->s_nr_dentry); security_sb_free(s); kfree(s->s_subtype); kfree(s->s_options); diff --git a/include/linux/fs.h b/include/linux/fs.h index f23bcb7..8150f52 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1399,6 +1399,8 @@ struct super_block { struct list_head s_dentry_lru; /* unused dentry lru */ int s_nr_dentry_unused; /* # of dentry on lru */ + struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ + /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; struct list_head s_inode_lru; /* unused inode lru */ -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v3 2/4] Keep nr_dentry per super block 2011-08-14 15:13 ` Glauber Costa @ 2011-08-14 17:38 ` Eric Dumazet -1 siblings, 0 replies; 65+ messages in thread From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit : > Now that we have per-sb shrinkers, it makes sense to have nr_dentries > stored per sb as well. We turn them into per-cpu counters so we can > keep acessing them without locking. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > } > diff --git a/fs/super.c b/fs/super.c > index 3f56a26..e95ac4f 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type) > s = NULL; > goto out; > } > + > + percpu_counter_init(&s->s_nr_dentry, 0); > + You should take care of possible error return at this point. (We probably should check the percpu_counter_init() done in tcp_init(), but they are done at boot time, with nearly no chance of failure) ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 2/4] Keep nr_dentry per super block @ 2011-08-14 17:38 ` Eric Dumazet 0 siblings, 0 replies; 65+ messages in thread From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit : > Now that we have per-sb shrinkers, it makes sense to have nr_dentries > stored per sb as well. We turn them into per-cpu counters so we can > keep acessing them without locking. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > } > diff --git a/fs/super.c b/fs/super.c > index 3f56a26..e95ac4f 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type) > s = NULL; > goto out; > } > + > + percpu_counter_init(&s->s_nr_dentry, 0); > + You should take care of possible error return at this point. (We probably should check the percpu_counter_init() done in tcp_init(), but they are done at boot time, with nearly no chance of failure) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <1313334832-1150-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 2/4] Keep nr_dentry per super block [not found] ` <1313334832-1150-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-14 17:38 ` Eric Dumazet 0 siblings, 0 replies; 65+ messages in thread From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit : > Now that we have per-sb shrinkers, it makes sense to have nr_dentries > stored per sb as well. We turn them into per-cpu counters so we can > keep acessing them without locking. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > } > diff --git a/fs/super.c b/fs/super.c > index 3f56a26..e95ac4f 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type) > s = NULL; > goto out; > } > + > + percpu_counter_init(&s->s_nr_dentry, 0); > + You should take care of possible error return at this point. (We probably should check the percpu_counter_init() done in tcp_init(), but they are done at boot time, with nearly no chance of failure) _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-14 15:13 ` Glauber Costa @ 2011-08-14 15:13 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA This patch lays the foundation for us to limit the dcache size. Each super block can have only a maximum amount of dentries under its sub-tree. Allocation fails if we we're over limit and the cache can't be pruned to free up space for the newcomers. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- fs/dcache.c | 28 ++++++++++++++++++++++++++++ fs/super.c | 1 + include/linux/fs.h | 1 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 815d9fd..ddd02a2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) } EXPORT_SYMBOL(shrink_dcache_parent); +static inline int dcache_mem_check(struct super_block *sb) +{ + struct shrink_control sc = { + .gfp_mask = GFP_KERNEL, + }; + + if (sb->s_nr_dentry_max == INT_MAX) + return 0; + + do { + int nr_dentry; + + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); + if (nr_dentry > sb->s_nr_dentry_max) + nr_dentry = + percpu_counter_sum_positive(&sb->s_nr_dentry); + if (nr_dentry < sb->s_nr_dentry_max) + return 0; + + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); + + return -ENOMEM; +} + /** * __d_alloc - allocate a dcache entry * @sb: filesystem it will belong to @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) struct dentry *dentry; char *dname; + if (dcache_mem_check(sb)) + return NULL; + dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); if (!dentry) return NULL; diff --git a/fs/super.c b/fs/super.c index e95ac4f..3db40fb 100644 --- a/fs/super.c +++ b/fs/super.c @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type) } percpu_counter_init(&s->s_nr_dentry, 0); + s->s_nr_dentry_max = INT_MAX; #ifdef CONFIG_SMP s->s_files = alloc_percpu(struct list_head); diff --git a/include/linux/fs.h b/include/linux/fs.h index 8150f52..bc773c3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1400,6 +1400,7 @@ struct super_block { int s_nr_dentry_unused; /* # of dentry on lru */ struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ + int s_nr_dentry_max; /* max # of dentry on this sb*/ /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Glauber Costa, Eric Dumazet This patch lays the foundation for us to limit the dcache size. Each super block can have only a maximum amount of dentries under its sub-tree. Allocation fails if we we're over limit and the cache can't be pruned to free up space for the newcomers. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Dave Chinner <david@fromorbit.com> CC: Eric Dumazet <eric.dumazet@gmail.com> --- fs/dcache.c | 28 ++++++++++++++++++++++++++++ fs/super.c | 1 + include/linux/fs.h | 1 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 815d9fd..ddd02a2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) } EXPORT_SYMBOL(shrink_dcache_parent); +static inline int dcache_mem_check(struct super_block *sb) +{ + struct shrink_control sc = { + .gfp_mask = GFP_KERNEL, + }; + + if (sb->s_nr_dentry_max == INT_MAX) + return 0; + + do { + int nr_dentry; + + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); + if (nr_dentry > sb->s_nr_dentry_max) + nr_dentry = + percpu_counter_sum_positive(&sb->s_nr_dentry); + if (nr_dentry < sb->s_nr_dentry_max) + return 0; + + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); + + return -ENOMEM; +} + /** * __d_alloc - allocate a dcache entry * @sb: filesystem it will belong to @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) struct dentry *dentry; char *dname; + if (dcache_mem_check(sb)) + return NULL; + dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); if (!dentry) return NULL; diff --git a/fs/super.c b/fs/super.c index e95ac4f..3db40fb 100644 --- a/fs/super.c +++ b/fs/super.c @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type) } percpu_counter_init(&s->s_nr_dentry, 0); + s->s_nr_dentry_max = INT_MAX; #ifdef CONFIG_SMP s->s_files = alloc_percpu(struct list_head); diff --git a/include/linux/fs.h b/include/linux/fs.h index 8150f52..bc773c3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1400,6 +1400,7 @@ struct super_block { int s_nr_dentry_unused; /* # of dentry on lru */ struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ + int s_nr_dentry_max; /* max # of dentry on this sb*/ /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
[parent not found: <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-15 7:03 ` Dave Chinner 2011-08-15 7:12 ` Pekka Enberg 1 sibling, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 7:03 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sun, Aug 14, 2011 at 07:13:51PM +0400, Glauber Costa wrote: > This patch lays the foundation for us to limit the dcache size. > Each super block can have only a maximum amount of dentries under its > sub-tree. Allocation fails if we we're over limit and the cache > can't be pruned to free up space for the newcomers. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > --- > fs/dcache.c | 28 ++++++++++++++++++++++++++++ > fs/super.c | 1 + > include/linux/fs.h | 1 + > 3 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 815d9fd..ddd02a2 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) > } > EXPORT_SYMBOL(shrink_dcache_parent); > > +static inline int dcache_mem_check(struct super_block *sb) > +{ > + struct shrink_control sc = { > + .gfp_mask = GFP_KERNEL, > + }; > + > + if (sb->s_nr_dentry_max == INT_MAX) > + return 0; > + > + do { > + int nr_dentry; > + > + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); > + if (nr_dentry > sb->s_nr_dentry_max) > + nr_dentry = > + percpu_counter_sum_positive(&sb->s_nr_dentry); > + if (nr_dentry < sb->s_nr_dentry_max) > + return 0; > + > + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ > + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); > + > + return -ENOMEM; > +} The changes here don't address the comments I made previously about the error range of the per-cpu counter and update frequency of the the global counter. w.r.t to small delta changes of the counted amount. That is, you cannot expect percpu_counter_read_positive() to change when you decrement a counter by 2 counts (assuming both objects are freed), and there's no guarantee that percpu_counter_sum_positive() will drop below the threshold either while percpu_counter_read_positive() is returning numbers above the threshold. That makes this a very expensive and oft-repeated loop. Indeed, that even assumes that 2 objects that are freed are dentries. If there are more inodes that dentries in memory, the the dentries won't even be trimmed, and you'll get stuck in the horribly expensive loop trimming the cache 2 inodes at a time until the caches start to balance. Shrinking must be done in large batches to be effective, not one object at a time. Further, the shrinker may not make progress freeing items, which will result in this code returning ENOMEM when the shrinker may simply have been passing over referenced (i.e. cache hot) dentries. That will return spurious ENOMEM results back to d_alloc, resulting in spurious failures that will be impossible to track down... This also demonstrates my comment about where you factored shrink_one_shrinker() to be the wrong place. You've already got a shrink-control structure, so adding a value to .nr_to_scan to the initialisation gets around the entire need to do all that guess work of running through the vmscan specific algorithms to get 2 objects freed. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 7:03 ` Dave Chinner @ 2011-08-15 7:12 ` Pekka Enberg 1 sibling, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 7:12 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > This patch lays the foundation for us to limit the dcache size. > Each super block can have only a maximum amount of dentries under its > sub-tree. Allocation fails if we we're over limit and the cache > can't be pruned to free up space for the newcomers. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > --- > fs/dcache.c | 28 ++++++++++++++++++++++++++++ > fs/super.c | 1 + > include/linux/fs.h | 1 + > 3 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 815d9fd..ddd02a2 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) > } > EXPORT_SYMBOL(shrink_dcache_parent); > > +static inline int dcache_mem_check(struct super_block *sb) > +{ > + struct shrink_control sc = { > + .gfp_mask = GFP_KERNEL, > + }; > + > + if (sb->s_nr_dentry_max == INT_MAX) > + return 0; > + > + do { > + int nr_dentry; > + > + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); > + if (nr_dentry > sb->s_nr_dentry_max) > + nr_dentry = > + percpu_counter_sum_positive(&sb->s_nr_dentry); > + if (nr_dentry < sb->s_nr_dentry_max) > + return 0; > + > + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ > + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); > + > + return -ENOMEM; > +} > + > /** > * __d_alloc - allocate a dcache entry > * @sb: filesystem it will belong to > @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) > struct dentry *dentry; > char *dname; > > + if (dcache_mem_check(sb)) > + return NULL; > + > dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); > if (!dentry) > return NULL; > diff --git a/fs/super.c b/fs/super.c > index e95ac4f..3db40fb 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type) > } > > percpu_counter_init(&s->s_nr_dentry, 0); > + s->s_nr_dentry_max = INT_MAX; > > #ifdef CONFIG_SMP > s->s_files = alloc_percpu(struct list_head); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 8150f52..bc773c3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1400,6 +1400,7 @@ struct super_block { > int s_nr_dentry_unused; /* # of dentry on lru */ > > struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ > + int s_nr_dentry_max; /* max # of dentry on this sb*/ > > /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ > spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; We track the total number of objects in mm/slub.c when CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code). Have you considered extending that for this purpose? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-14 15:13 ` Glauber Costa (?) (?) @ 2011-08-15 7:03 ` Dave Chinner -1 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 7:03 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet On Sun, Aug 14, 2011 at 07:13:51PM +0400, Glauber Costa wrote: > This patch lays the foundation for us to limit the dcache size. > Each super block can have only a maximum amount of dentries under its > sub-tree. Allocation fails if we we're over limit and the cache > can't be pruned to free up space for the newcomers. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Dave Chinner <david@fromorbit.com> > CC: Eric Dumazet <eric.dumazet@gmail.com> > --- > fs/dcache.c | 28 ++++++++++++++++++++++++++++ > fs/super.c | 1 + > include/linux/fs.h | 1 + > 3 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 815d9fd..ddd02a2 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) > } > EXPORT_SYMBOL(shrink_dcache_parent); > > +static inline int dcache_mem_check(struct super_block *sb) > +{ > + struct shrink_control sc = { > + .gfp_mask = GFP_KERNEL, > + }; > + > + if (sb->s_nr_dentry_max == INT_MAX) > + return 0; > + > + do { > + int nr_dentry; > + > + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); > + if (nr_dentry > sb->s_nr_dentry_max) > + nr_dentry = > + percpu_counter_sum_positive(&sb->s_nr_dentry); > + if (nr_dentry < sb->s_nr_dentry_max) > + return 0; > + > + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ > + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); > + > + return -ENOMEM; > +} The changes here don't address the comments I made previously about the error range of the per-cpu counter and update frequency of the the global counter. w.r.t to small delta changes of the counted amount. That is, you cannot expect percpu_counter_read_positive() to change when you decrement a counter by 2 counts (assuming both objects are freed), and there's no guarantee that percpu_counter_sum_positive() will drop below the threshold either while percpu_counter_read_positive() is returning numbers above the threshold. That makes this a very expensive and oft-repeated loop. Indeed, that even assumes that 2 objects that are freed are dentries. If there are more inodes that dentries in memory, the the dentries won't even be trimmed, and you'll get stuck in the horribly expensive loop trimming the cache 2 inodes at a time until the caches start to balance. Shrinking must be done in large batches to be effective, not one object at a time. Further, the shrinker may not make progress freeing items, which will result in this code returning ENOMEM when the shrinker may simply have been passing over referenced (i.e. cache hot) dentries. That will return spurious ENOMEM results back to d_alloc, resulting in spurious failures that will be impossible to track down... This also demonstrates my comment about where you factored shrink_one_shrinker() to be the wrong place. You've already got a shrink-control structure, so adding a value to .nr_to_scan to the initialisation gets around the entire need to do all that guess work of running through the vmscan specific algorithms to get 2 objects freed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-14 15:13 ` Glauber Costa ` (2 preceding siblings ...) (?) @ 2011-08-15 7:12 ` Pekka Enberg [not found] ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 10:46 ` Dave Chinner -1 siblings, 2 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 7:12 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Eric Dumazet, Christoph Lameter, David Rientjes On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer@parallels.com> wrote: > This patch lays the foundation for us to limit the dcache size. > Each super block can have only a maximum amount of dentries under its > sub-tree. Allocation fails if we we're over limit and the cache > can't be pruned to free up space for the newcomers. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Dave Chinner <david@fromorbit.com> > CC: Eric Dumazet <eric.dumazet@gmail.com> > --- > fs/dcache.c | 28 ++++++++++++++++++++++++++++ > fs/super.c | 1 + > include/linux/fs.h | 1 + > 3 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 815d9fd..ddd02a2 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent) > } > EXPORT_SYMBOL(shrink_dcache_parent); > > +static inline int dcache_mem_check(struct super_block *sb) > +{ > + struct shrink_control sc = { > + .gfp_mask = GFP_KERNEL, > + }; > + > + if (sb->s_nr_dentry_max == INT_MAX) > + return 0; > + > + do { > + int nr_dentry; > + > + nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry); > + if (nr_dentry > sb->s_nr_dentry_max) > + nr_dentry = > + percpu_counter_sum_positive(&sb->s_nr_dentry); > + if (nr_dentry < sb->s_nr_dentry_max) > + return 0; > + > + /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */ > + } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0)); > + > + return -ENOMEM; > +} > + > /** > * __d_alloc - allocate a dcache entry > * @sb: filesystem it will belong to > @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) > struct dentry *dentry; > char *dname; > > + if (dcache_mem_check(sb)) > + return NULL; > + > dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); > if (!dentry) > return NULL; > diff --git a/fs/super.c b/fs/super.c > index e95ac4f..3db40fb 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type) > } > > percpu_counter_init(&s->s_nr_dentry, 0); > + s->s_nr_dentry_max = INT_MAX; > > #ifdef CONFIG_SMP > s->s_files = alloc_percpu(struct list_head); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 8150f52..bc773c3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1400,6 +1400,7 @@ struct super_block { > int s_nr_dentry_unused; /* # of dentry on lru */ > > struct percpu_counter s_nr_dentry; /* # of dentry on this sb */ > + int s_nr_dentry_max; /* max # of dentry on this sb*/ > > /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */ > spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp; We track the total number of objects in mm/slub.c when CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code). Have you considered extending that for this purpose? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 10:46 ` Dave Chinner 0 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 10:46 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On Mon, Aug 15, 2011 at 10:12:06AM +0300, Pekka Enberg wrote: > On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > > This patch lays the foundation for us to limit the dcache size. > > Each super block can have only a maximum amount of dentries under its > > sub-tree. Allocation fails if we we're over limit and the cache > > can't be pruned to free up space for the newcomers. ..... > We track the total number of objects in mm/slub.c when > CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code). > Have you considered extending that for this purpose? That's usage for the entire slab, though, and we don't have a dentry slab per superblock so I don't think that helps us. And with slab merging, I think that even if we did have a slab per superblock, they'd end up in the same slab context anyway, right? Ideally what we need is a slab, LRU and shrinkers all rolled into a single infrastructure handle so we can simply set them up per object, per context etc and not have to re-invent the wheel for every single slab cache/LRU/shrinker setup we have in the kernel. I've got a rough node-aware generic LRU/shrinker infrastructure prototype that is generic enough for most of the existing slab caches with shrinkers, but I haven't looked at what is needed to integrate it with the slab cache code. That's mainly because I don't like the idea of having to implement the same thing 3 times in 3 different ways and debug them all before anyone would consider it for inclusion in the kernel. Once I've sorted out the select_parent() use-the-LRU-for-disposal abuse and have a patch set that survives a 'rm -rf *' operation, maybe we can then talk about what is needed to integrate stuff into the slab caches.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 7:12 ` Pekka Enberg [not found] ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 10:46 ` Dave Chinner 2011-08-15 10:58 ` Pekka Enberg 2011-08-15 10:58 ` Pekka Enberg 1 sibling, 2 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 10:46 UTC (permalink / raw) To: Pekka Enberg Cc: Glauber Costa, linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 10:12:06AM +0300, Pekka Enberg wrote: > On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer@parallels.com> wrote: > > This patch lays the foundation for us to limit the dcache size. > > Each super block can have only a maximum amount of dentries under its > > sub-tree. Allocation fails if we we're over limit and the cache > > can't be pruned to free up space for the newcomers. ..... > We track the total number of objects in mm/slub.c when > CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code). > Have you considered extending that for this purpose? That's usage for the entire slab, though, and we don't have a dentry slab per superblock so I don't think that helps us. And with slab merging, I think that even if we did have a slab per superblock, they'd end up in the same slab context anyway, right? Ideally what we need is a slab, LRU and shrinkers all rolled into a single infrastructure handle so we can simply set them up per object, per context etc and not have to re-invent the wheel for every single slab cache/LRU/shrinker setup we have in the kernel. I've got a rough node-aware generic LRU/shrinker infrastructure prototype that is generic enough for most of the existing slab caches with shrinkers, but I haven't looked at what is needed to integrate it with the slab cache code. That's mainly because I don't like the idea of having to implement the same thing 3 times in 3 different ways and debug them all before anyone would consider it for inclusion in the kernel. Once I've sorted out the select_parent() use-the-LRU-for-disposal abuse and have a patch set that survives a 'rm -rf *' operation, maybe we can then talk about what is needed to integrate stuff into the slab caches.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 10:46 ` Dave Chinner @ 2011-08-15 10:58 ` Pekka Enberg [not found] ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 11:05 ` Pavel Emelyanov 2011-08-15 10:58 ` Pekka Enberg 1 sibling, 2 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 10:58 UTC (permalink / raw) To: Dave Chinner Cc: Glauber Costa, linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes Hi Dave, On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: > That's usage for the entire slab, though, and we don't have a dentry > slab per superblock so I don't think that helps us. And with slab > merging, I think that even if we did have a slab per superblock, > they'd end up in the same slab context anyway, right? You could add a flag to disable slab merging but there's no sane way to fix the per-superblock thing in slab. On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: > Ideally what we need is a slab, LRU and shrinkers all rolled into a > single infrastructure handle so we can simply set them up per > object, per context etc and not have to re-invent the wheel for > every single slab cache/LRU/shrinker setup we have in the kernel. > > I've got a rough node-aware generic LRU/shrinker infrastructure > prototype that is generic enough for most of the existing slab > caches with shrinkers, but I haven't looked at what is needed to > integrate it with the slab cache code. That's mainly because I don't > like the idea of having to implement the same thing 3 times in 3 > different ways and debug them all before anyone would consider it > for inclusion in the kernel. > > Once I've sorted out the select_parent() use-the-LRU-for-disposal > abuse and have a patch set that survives a 'rm -rf *' operation, > maybe we can then talk about what is needed to integrate stuff into > the slab caches.... Well, now that I really understand what you're trying to do here, it's probably best to keep slab as-is and implement "slab accounting" on top of it. You'd have something like you do now but in slightly more generic form: struct kmem_accounted_cache { struct kmem_cache *cache; /* ... statistics... */ } void *kmem_accounted_alloc(struct kmem_accounted_cache *c) { if (/* within limits */) return kmem_cache_alloc(c->cache); return NULL; } Does something like that make sense to you? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 11:05 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On 08/15/2011 02:58 PM, Pekka Enberg wrote: > Hi Dave, > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: >> That's usage for the entire slab, though, and we don't have a dentry >> slab per superblock so I don't think that helps us. And with slab >> merging, I think that even if we did have a slab per superblock, >> they'd end up in the same slab context anyway, right? > > You could add a flag to disable slab merging but there's no sane way > to fix the per-superblock thing in slab. > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: >> Ideally what we need is a slab, LRU and shrinkers all rolled into a >> single infrastructure handle so we can simply set them up per >> object, per context etc and not have to re-invent the wheel for >> every single slab cache/LRU/shrinker setup we have in the kernel. >> >> I've got a rough node-aware generic LRU/shrinker infrastructure >> prototype that is generic enough for most of the existing slab >> caches with shrinkers, but I haven't looked at what is needed to >> integrate it with the slab cache code. That's mainly because I don't >> like the idea of having to implement the same thing 3 times in 3 >> different ways and debug them all before anyone would consider it >> for inclusion in the kernel. >> >> Once I've sorted out the select_parent() use-the-LRU-for-disposal >> abuse and have a patch set that survives a 'rm -rf *' operation, >> maybe we can then talk about what is needed to integrate stuff into >> the slab caches.... > > Well, now that I really understand what you're trying to do here, it's > probably best to keep slab as-is and implement "slab accounting" on > top of it. > > You'd have something like you do now but in slightly more generic form: > > struct kmem_accounted_cache { > struct kmem_cache *cache; > /* ... statistics... */ > } > > void *kmem_accounted_alloc(struct kmem_accounted_cache *c) > { > if (/* within limits */) > return kmem_cache_alloc(c->cache); > > return NULL; > } > > Does something like that make sense to you? This will make sense, since the kernel memory management per-cgroup is one of the things we'd live to have, but this particular idea will definitely not work in case we keep the containers' files on one partition keeping each container in its own chroot environment. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 10:58 ` Pekka Enberg @ 2011-08-15 11:05 ` Pavel Emelyanov 2011-08-15 11:05 ` Pavel Emelyanov 1 sibling, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 02:58 PM, Pekka Enberg wrote: > Hi Dave, > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> That's usage for the entire slab, though, and we don't have a dentry >> slab per superblock so I don't think that helps us. And with slab >> merging, I think that even if we did have a slab per superblock, >> they'd end up in the same slab context anyway, right? > > You could add a flag to disable slab merging but there's no sane way > to fix the per-superblock thing in slab. > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> Ideally what we need is a slab, LRU and shrinkers all rolled into a >> single infrastructure handle so we can simply set them up per >> object, per context etc and not have to re-invent the wheel for >> every single slab cache/LRU/shrinker setup we have in the kernel. >> >> I've got a rough node-aware generic LRU/shrinker infrastructure >> prototype that is generic enough for most of the existing slab >> caches with shrinkers, but I haven't looked at what is needed to >> integrate it with the slab cache code. That's mainly because I don't >> like the idea of having to implement the same thing 3 times in 3 >> different ways and debug them all before anyone would consider it >> for inclusion in the kernel. >> >> Once I've sorted out the select_parent() use-the-LRU-for-disposal >> abuse and have a patch set that survives a 'rm -rf *' operation, >> maybe we can then talk about what is needed to integrate stuff into >> the slab caches.... > > Well, now that I really understand what you're trying to do here, it's > probably best to keep slab as-is and implement "slab accounting" on > top of it. > > You'd have something like you do now but in slightly more generic form: > > struct kmem_accounted_cache { > struct kmem_cache *cache; > /* ... statistics... */ > } > > void *kmem_accounted_alloc(struct kmem_accounted_cache *c) > { > if (/* within limits */) > return kmem_cache_alloc(c->cache); > > return NULL; > } > > Does something like that make sense to you? This will make sense, since the kernel memory management per-cgroup is one of the things we'd live to have, but this particular idea will definitely not work in case we keep the containers' files on one partition keeping each container in its own chroot environment. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 11:05 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 02:58 PM, Pekka Enberg wrote: > Hi Dave, > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> That's usage for the entire slab, though, and we don't have a dentry >> slab per superblock so I don't think that helps us. And with slab >> merging, I think that even if we did have a slab per superblock, >> they'd end up in the same slab context anyway, right? > > You could add a flag to disable slab merging but there's no sane way > to fix the per-superblock thing in slab. > > On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> Ideally what we need is a slab, LRU and shrinkers all rolled into a >> single infrastructure handle so we can simply set them up per >> object, per context etc and not have to re-invent the wheel for >> every single slab cache/LRU/shrinker setup we have in the kernel. >> >> I've got a rough node-aware generic LRU/shrinker infrastructure >> prototype that is generic enough for most of the existing slab >> caches with shrinkers, but I haven't looked at what is needed to >> integrate it with the slab cache code. That's mainly because I don't >> like the idea of having to implement the same thing 3 times in 3 >> different ways and debug them all before anyone would consider it >> for inclusion in the kernel. >> >> Once I've sorted out the select_parent() use-the-LRU-for-disposal >> abuse and have a patch set that survives a 'rm -rf *' operation, >> maybe we can then talk about what is needed to integrate stuff into >> the slab caches.... > > Well, now that I really understand what you're trying to do here, it's > probably best to keep slab as-is and implement "slab accounting" on > top of it. > > You'd have something like you do now but in slightly more generic form: > > struct kmem_accounted_cache { > struct kmem_cache *cache; > /* ... statistics... */ > } > > void *kmem_accounted_alloc(struct kmem_accounted_cache *c) > { > if (/* within limits */) > return kmem_cache_alloc(c->cache); > > return NULL; > } > > Does something like that make sense to you? This will make sense, since the kernel memory management per-cgroup is one of the things we'd live to have, but this particular idea will definitely not work in case we keep the containers' files on one partition keeping each container in its own chroot environment. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 11:05 ` Pavel Emelyanov @ 2011-08-15 11:14 ` Pekka Enberg -1 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes Hi Pavel, On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > This will make sense, since the kernel memory management per-cgroup is one of the > things we'd live to have, but this particular idea will definitely not work in case > we keep the containers' files on one partition keeping each container in its own > chroot environment. And you want a per-container dcache limit? Will the containers share the same superblock? Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 11:14 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes Hi Pavel, On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > This will make sense, since the kernel memory management per-cgroup is one of the > things we'd live to have, but this particular idea will definitely not work in case > we keep the containers' files on one partition keeping each container in its own > chroot environment. And you want a per-container dcache limit? Will the containers share the same superblock? Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 11:14 ` Pekka Enberg @ 2011-08-15 11:32 ` Pavel Emelyanov -1 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 03:14 PM, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >> This will make sense, since the kernel memory management per-cgroup is one of the >> things we'd live to have, but this particular idea will definitely not work in case >> we keep the containers' files on one partition keeping each container in its own >> chroot environment. > > And you want a per-container dcache limit? To be more specific - we want to protect the node with >1 containers from one of them growing the dcache infinitely. One of the solutions to this - per container dcache limit. > Will the containers share the same superblock? Yes, this is typical scenario for both OpenVZ and LXC now. > Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? If by this you mean "account for all the kmem associated with particular superblock" then this is OK for us, but this can't be done in a simple if (used + size > limit) return -ENOMEM else { used += size; return 0; } manner, since once we hit the limit we should shrink the unused dentries. And most of the patches are about this. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 11:32 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 03:14 PM, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >> This will make sense, since the kernel memory management per-cgroup is one of the >> things we'd live to have, but this particular idea will definitely not work in case >> we keep the containers' files on one partition keeping each container in its own >> chroot environment. > > And you want a per-container dcache limit? To be more specific - we want to protect the node with >1 containers from one of them growing the dcache infinitely. One of the solutions to this - per container dcache limit. > Will the containers share the same superblock? Yes, this is typical scenario for both OpenVZ and LXC now. > Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? If by this you mean "account for all the kmem associated with particular superblock" then this is OK for us, but this can't be done in a simple if (used + size > limit) return -ENOMEM else { used += size; return 0; } manner, since once we hit the limit we should shrink the unused dentries. And most of the patches are about this. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 11:32 ` Pavel Emelyanov @ 2011-08-15 11:55 ` Pekka Enberg -1 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? > > If by this you mean "account for all the kmem associated with particular superblock" then > this is OK for us, but this can't be done in a simple > > if (used + size > limit) > return -ENOMEM > else { > used += size; > return 0; > } > > manner, since once we hit the limit we should shrink the unused dentries. And most of the > patches are about this. So you want to shrink dentries that were allocated in the container that hit the dcache limit? How does this patch set deal with that? Or are you referring to some other patches? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 11:55 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? > > If by this you mean "account for all the kmem associated with particular superblock" then > this is OK for us, but this can't be done in a simple > > if (used + size > limit) > return -ENOMEM > else { > used += size; > return 0; > } > > manner, since once we hit the limit we should shrink the unused dentries. And most of the > patches are about this. So you want to shrink dentries that were allocated in the container that hit the dcache limit? How does this patch set deal with that? Or are you referring to some other patches? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 11:55 ` Pekka Enberg @ 2011-08-15 12:12 ` Pavel Emelyanov -1 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 03:55 PM, Pekka Enberg wrote: > On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >> >> If by this you mean "account for all the kmem associated with particular superblock" then >> this is OK for us, but this can't be done in a simple >> >> if (used + size > limit) >> return -ENOMEM >> else { >> used += size; >> return 0; >> } >> >> manner, since once we hit the limit we should shrink the unused dentries. And most of the >> patches are about this. > > So you want to shrink dentries that were allocated in the container > that hit the dcache limit? How does this patch set deal with that? Or > are you referring to some other patches? Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: http://lwn.net/Articles/441164/ > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 12:12 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On 08/15/2011 03:55 PM, Pekka Enberg wrote: > On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >> >> If by this you mean "account for all the kmem associated with particular superblock" then >> this is OK for us, but this can't be done in a simple >> >> if (used + size > limit) >> return -ENOMEM >> else { >> used += size; >> return 0; >> } >> >> manner, since once we hit the limit we should shrink the unused dentries. And most of the >> patches are about this. > > So you want to shrink dentries that were allocated in the container > that hit the dcache limit? How does this patch set deal with that? Or > are you referring to some other patches? Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: http://lwn.net/Articles/441164/ > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 12:12 ` Pavel Emelyanov @ 2011-08-15 12:23 ` Pekka Enberg -1 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > On 08/15/2011 03:55 PM, Pekka Enberg wrote: >> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >>> >>> If by this you mean "account for all the kmem associated with particular superblock" then >>> this is OK for us, but this can't be done in a simple >>> >>> if (used + size > limit) >>> return -ENOMEM >>> else { >>> used += size; >>> return 0; >>> } >>> >>> manner, since once we hit the limit we should shrink the unused dentries. And most of the >>> patches are about this. >> >> So you want to shrink dentries that were allocated in the container >> that hit the dcache limit? How does this patch set deal with that? Or >> are you referring to some other patches? > > Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: > http://lwn.net/Articles/441164/ OK. I suppose it's best to keep it separate of the slab code for now then. Maybe you're able to come with something more generic that we could make part of slab APIs later. Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 12:23 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw) To: Pavel Emelyanov Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > On 08/15/2011 03:55 PM, Pekka Enberg wrote: >> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote: >>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >>> >>> If by this you mean "account for all the kmem associated with particular superblock" then >>> this is OK for us, but this can't be done in a simple >>> >>> if (used + size > limit) >>> return -ENOMEM >>> else { >>> used += size; >>> return 0; >>> } >>> >>> manner, since once we hit the limit we should shrink the unused dentries. And most of the >>> patches are about this. >> >> So you want to shrink dentries that were allocated in the container >> that hit the dcache limit? How does this patch set deal with that? Or >> are you referring to some other patches? > > Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: > http://lwn.net/Articles/441164/ OK. I suppose it's best to keep it separate of the slab code for now then. Maybe you're able to come with something more generic that we could make part of slab APIs later. Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 12:23 ` Pekka Enberg @ 2011-08-15 12:37 ` Pavel Emelyanov -1 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes >> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: >> http://lwn.net/Articles/441164/ > > OK. I suppose it's best to keep it separate of the slab code for now > then. Maybe you're able to come with something more generic that we > could make part of slab APIs later. Yup! We also have plans for having control over the kernel memory usage by containers, so we'll certainly get back to it :) > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-15 12:37 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw) To: Pekka Enberg Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes >> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: >> http://lwn.net/Articles/441164/ > > OK. I suppose it's best to keep it separate of the slab code for now > then. Maybe you're able to come with something more generic that we > could make part of slab APIs later. Yup! We also have plans for having control over the kernel memory usage by containers, so we'll certainly get back to it :) > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <CAOJsxLEVfGrzUQv0fOpOyw3AaOLOHcWbvLJL1NdrHS6M2j5o1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLEVfGrzUQv0fOpOyw3AaOLOHcWbvLJL1NdrHS6M2j5o1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 12:37 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen >> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: >> http://lwn.net/Articles/441164/ > > OK. I suppose it's best to keep it separate of the slab code for now > then. Maybe you're able to come with something more generic that we > could make part of slab APIs later. Yup! We also have plans for having control over the kernel memory usage by containers, so we'll certainly get back to it :) > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <4E490D47.8050105-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <4E490D47.8050105-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-15 12:23 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw) To: Pavel Emelyanov Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > On 08/15/2011 03:55 PM, Pekka Enberg wrote: >> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: >>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >>> >>> If by this you mean "account for all the kmem associated with particular superblock" then >>> this is OK for us, but this can't be done in a simple >>> >>> if (used + size > limit) >>> return -ENOMEM >>> else { >>> used += size; >>> return 0; >>> } >>> >>> manner, since once we hit the limit we should shrink the unused dentries. And most of the >>> patches are about this. >> >> So you want to shrink dentries that were allocated in the container >> that hit the dcache limit? How does this patch set deal with that? Or >> are you referring to some other patches? > > Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: > http://lwn.net/Articles/441164/ OK. I suppose it's best to keep it separate of the slab code for now then. Maybe you're able to come with something more generic that we could make part of slab APIs later. Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <CAOJsxLErFcxuuqnWkRbOAHEFbeGrKp3YMZ-144=m16oBXCHJ2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLErFcxuuqnWkRbOAHEFbeGrKp3YMZ-144=m16oBXCHJ2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 12:12 ` Pavel Emelyanov 0 siblings, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On 08/15/2011 03:55 PM, Pekka Enberg wrote: > On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: >>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? >> >> If by this you mean "account for all the kmem associated with particular superblock" then >> this is OK for us, but this can't be done in a simple >> >> if (used + size > limit) >> return -ENOMEM >> else { >> used += size; >> return 0; >> } >> >> manner, since once we hit the limit we should shrink the unused dentries. And most of the >> patches are about this. > > So you want to shrink dentries that were allocated in the container > that hit the dcache limit? How does this patch set deal with that? Or > are you referring to some other patches? Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here: http://lwn.net/Articles/441164/ > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <4E4903C1.9050207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <4E4903C1.9050207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-15 11:55 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw) To: Pavel Emelyanov Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: >> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? > > If by this you mean "account for all the kmem associated with particular superblock" then > this is OK for us, but this can't be done in a simple > > if (used + size > limit) > return -ENOMEM > else { > used += size; > return 0; > } > > manner, since once we hit the limit we should shrink the unused dentries. And most of the > patches are about this. So you want to shrink dentries that were allocated in the container that hit the dcache limit? How does this patch set deal with that? Or are you referring to some other patches? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-08-15 11:32 ` Pavel Emelyanov 2011-08-16 2:11 ` Dave Chinner 1 sibling, 0 replies; 65+ messages in thread From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On 08/15/2011 03:14 PM, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: >> This will make sense, since the kernel memory management per-cgroup is one of the >> things we'd live to have, but this particular idea will definitely not work in case >> we keep the containers' files on one partition keeping each container in its own >> chroot environment. > > And you want a per-container dcache limit? To be more specific - we want to protect the node with >1 containers from one of them growing the dcache infinitely. One of the solutions to this - per container dcache limit. > Will the containers share the same superblock? Yes, this is typical scenario for both OpenVZ and LXC now. > Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? If by this you mean "account for all the kmem associated with particular superblock" then this is OK for us, but this can't be done in a simple if (used + size > limit) return -ENOMEM else { used += size; return 0; } manner, since once we hit the limit we should shrink the unused dentries. And most of the patches are about this. > Pekka > . > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 11:32 ` Pavel Emelyanov @ 2011-08-16 2:11 ` Dave Chinner 1 sibling, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-16 2:11 UTC (permalink / raw) To: Pekka Enberg Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > > This will make sense, since the kernel memory management per-cgroup is one of the > > things we'd live to have, but this particular idea will definitely not work in case > > we keep the containers' files on one partition keeping each container in its own > > chroot environment. > > And you want a per-container dcache limit? Will the containers share > the same superblock? Yes, and that's one of the problems with the "arbitrary container" approach to controlling the dentry cache size. Arbitrary containers don't map easily to predictable and scalable LRU and reclaim implementations. Hence right now the container scope is limited to per-superblock. > Couldn't you simply do per-container "struct > kmem_accounted_cache" in struct superblock? Probably could do it that way, but it's still not really and integrated solution. What we'll end up with is this LRU structure: struct lru_node { struct list_head lru; spinlock_t lock; long nr_items; } ____cacheline_aligned_in_smp; struct lru { struct kmem_accounted_cache *cache; struct lru_node lru_node[MAX_NUMNODES]; nodemask_t active_nodes; int (*isolate_item)(struct list_head *item); int (*dispose)(struct list_head *list); } Where the only thing that the lru->cache is used for is getting the number of items allocated to the cache. Seems relatively pointless to make that statistic abstraction for just a single value that we can get via a simple per-cpu counter... Then, when you consider SLUB has this structure for every individual slab cache: struct kmem_cache_node { spinlock_t list_lock; /* Protect partial list and nr_partial */ unsigned long nr_partial; struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif }; you can see why tight integration of the per-node LRU infrastructure is appealing - there's no unnecessary duplication and the accounting is done in the right spot. It also means there is only one shrinker implmentation for all slabs, with a couple of simple per-slab callbacks for isolating objects for disposal and then to dispose of them. This would mean that most slab caches that use shrinkers would no longer need to implement their own LRU, get LRU scalability and node-aware reclaim for free, have built in size limits, etc. And FWIW, integrating the LRU shrinker mechanism into the slab cache also provides the mechanisms needed for capping the size of the cache as well as slab defragmentation. Much smarter things can be done when you know both the age and the locality of objects. e.g. there's no point preventing allocation from a slab due to maximum object count limitations if there are partial pages in the slab cache because the allocation can be done without increasing memory footprint..... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 11:14 ` Pekka Enberg @ 2011-08-16 2:11 ` Dave Chinner -1 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-16 2:11 UTC (permalink / raw) To: Pekka Enberg Cc: Pavel Emelyanov, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > > This will make sense, since the kernel memory management per-cgroup is one of the > > things we'd live to have, but this particular idea will definitely not work in case > > we keep the containers' files on one partition keeping each container in its own > > chroot environment. > > And you want a per-container dcache limit? Will the containers share > the same superblock? Yes, and that's one of the problems with the "arbitrary container" approach to controlling the dentry cache size. Arbitrary containers don't map easily to predictable and scalable LRU and reclaim implementations. Hence right now the container scope is limited to per-superblock. > Couldn't you simply do per-container "struct > kmem_accounted_cache" in struct superblock? Probably could do it that way, but it's still not really and integrated solution. What we'll end up with is this LRU structure: struct lru_node { struct list_head lru; spinlock_t lock; long nr_items; } ____cacheline_aligned_in_smp; struct lru { struct kmem_accounted_cache *cache; struct lru_node lru_node[MAX_NUMNODES]; nodemask_t active_nodes; int (*isolate_item)(struct list_head *item); int (*dispose)(struct list_head *list); } Where the only thing that the lru->cache is used for is getting the number of items allocated to the cache. Seems relatively pointless to make that statistic abstraction for just a single value that we can get via a simple per-cpu counter... Then, when you consider SLUB has this structure for every individual slab cache: struct kmem_cache_node { spinlock_t list_lock; /* Protect partial list and nr_partial */ unsigned long nr_partial; struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif }; you can see why tight integration of the per-node LRU infrastructure is appealing - there's no unnecessary duplication and the accounting is done in the right spot. It also means there is only one shrinker implmentation for all slabs, with a couple of simple per-slab callbacks for isolating objects for disposal and then to dispose of them. This would mean that most slab caches that use shrinkers would no longer need to implement their own LRU, get LRU scalability and node-aware reclaim for free, have built in size limits, etc. And FWIW, integrating the LRU shrinker mechanism into the slab cache also provides the mechanisms needed for capping the size of the cache as well as slab defragmentation. Much smarter things can be done when you know both the age and the locality of objects. e.g. there's no point preventing allocation from a slab due to maximum object count limitations if there are partial pages in the slab cache because the allocation can be done without increasing memory footprint..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock @ 2011-08-16 2:11 ` Dave Chinner 0 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-16 2:11 UTC (permalink / raw) To: Pekka Enberg Cc: Pavel Emelyanov, Glauber Costa, linux-kernel, linux-fsdevel, containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet, Christoph Lameter, David Rientjes On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote: > Hi Pavel, > > On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote: > > This will make sense, since the kernel memory management per-cgroup is one of the > > things we'd live to have, but this particular idea will definitely not work in case > > we keep the containers' files on one partition keeping each container in its own > > chroot environment. > > And you want a per-container dcache limit? Will the containers share > the same superblock? Yes, and that's one of the problems with the "arbitrary container" approach to controlling the dentry cache size. Arbitrary containers don't map easily to predictable and scalable LRU and reclaim implementations. Hence right now the container scope is limited to per-superblock. > Couldn't you simply do per-container "struct > kmem_accounted_cache" in struct superblock? Probably could do it that way, but it's still not really and integrated solution. What we'll end up with is this LRU structure: struct lru_node { struct list_head lru; spinlock_t lock; long nr_items; } ____cacheline_aligned_in_smp; struct lru { struct kmem_accounted_cache *cache; struct lru_node lru_node[MAX_NUMNODES]; nodemask_t active_nodes; int (*isolate_item)(struct list_head *item); int (*dispose)(struct list_head *list); } Where the only thing that the lru->cache is used for is getting the number of items allocated to the cache. Seems relatively pointless to make that statistic abstraction for just a single value that we can get via a simple per-cpu counter... Then, when you consider SLUB has this structure for every individual slab cache: struct kmem_cache_node { spinlock_t list_lock; /* Protect partial list and nr_partial */ unsigned long nr_partial; struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif }; you can see why tight integration of the per-node LRU infrastructure is appealing - there's no unnecessary duplication and the accounting is done in the right spot. It also means there is only one shrinker implmentation for all slabs, with a couple of simple per-slab callbacks for isolating objects for disposal and then to dispose of them. This would mean that most slab caches that use shrinkers would no longer need to implement their own LRU, get LRU scalability and node-aware reclaim for free, have built in size limits, etc. And FWIW, integrating the LRU shrinker mechanism into the slab cache also provides the mechanisms needed for capping the size of the cache as well as slab defragmentation. Much smarter things can be done when you know both the age and the locality of objects. e.g. there's no point preventing allocation from a slab due to maximum object count limitations if there are partial pages in the slab cache because the allocation can be done without increasing memory footprint..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <4E48FD8A.90401-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 3/4] limit nr_dentries per superblock [not found] ` <4E48FD8A.90401-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-15 11:14 ` Pekka Enberg 0 siblings, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw) To: Pavel Emelyanov Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen Hi Pavel, On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > This will make sense, since the kernel memory management per-cgroup is one of the > things we'd live to have, but this particular idea will definitely not work in case > we keep the containers' files on one partition keeping each container in its own > chroot environment. And you want a per-container dcache limit? Will the containers share the same superblock? Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 3/4] limit nr_dentries per superblock 2011-08-15 10:46 ` Dave Chinner 2011-08-15 10:58 ` Pekka Enberg @ 2011-08-15 10:58 ` Pekka Enberg 1 sibling, 0 replies; 65+ messages in thread From: Pekka Enberg @ 2011-08-15 10:58 UTC (permalink / raw) To: Dave Chinner Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Al Viro, David Rientjes, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Dave Hansen Hi Dave, On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: > That's usage for the entire slab, though, and we don't have a dentry > slab per superblock so I don't think that helps us. And with slab > merging, I think that even if we did have a slab per superblock, > they'd end up in the same slab context anyway, right? You could add a flag to disable slab merging but there's no sane way to fix the per-superblock thing in slab. On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: > Ideally what we need is a slab, LRU and shrinkers all rolled into a > single infrastructure handle so we can simply set them up per > object, per context etc and not have to re-invent the wheel for > every single slab cache/LRU/shrinker setup we have in the kernel. > > I've got a rough node-aware generic LRU/shrinker infrastructure > prototype that is generic enough for most of the existing slab > caches with shrinkers, but I haven't looked at what is needed to > integrate it with the slab cache code. That's mainly because I don't > like the idea of having to implement the same thing 3 times in 3 > different ways and debug them all before anyone would consider it > for inclusion in the kernel. > > Once I've sorted out the select_parent() use-the-LRU-for-disposal > abuse and have a patch set that survives a 'rm -rf *' operation, > maybe we can then talk about what is needed to integrate stuff into > the slab caches.... Well, now that I really understand what you're trying to do here, it's probably best to keep slab as-is and implement "slab accounting" on top of it. You'd have something like you do now but in slightly more generic form: struct kmem_accounted_cache { struct kmem_cache *cache; /* ... statistics... */ } void *kmem_accounted_alloc(struct kmem_accounted_cache *c) { if (/* within limits */) return kmem_cache_alloc(c->cache); return NULL; } Does something like that make sense to you? Pekka ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 4/4] parse options in the vfs level 2011-08-14 15:13 ` Glauber Costa @ 2011-08-14 15:13 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA This patch introduces a simple generic vfs option parser. Right now, the only option we have is to limit the size of the dcache. So any user that wants to have a dcache entries limit, can specify: mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint> It is supposed to work well with remounts, allowing it to change multiple over the course of the filesystem's lifecycle. I find mount a natural interface for handling filesystem options, so that's what I've choosen. Feel free to yell at it at will if you disagree. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- fs/dcache.c | 7 +++ fs/namespace.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dcache.h | 4 ++ 3 files changed, 116 insertions(+), 0 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index ddd02a2..0551305 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1205,6 +1205,13 @@ static inline int dcache_mem_check(struct super_block *sb) return -ENOMEM; } +int vfs_set_dcache_size(struct super_block *sb, int size) +{ + sb->s_nr_dentry_max = size; + + return dcache_mem_check(sb); +} + /** * __d_alloc - allocate a dcache entry * @sb: filesystem it will belong to diff --git a/fs/namespace.c b/fs/namespace.c index 22bfe82..43b2cdb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -31,6 +31,7 @@ #include <linux/idr.h> #include <linux/fs_struct.h> #include <linux/fsnotify.h> +#include <linux/parser.h> #include <asm/uaccess.h> #include <asm/unistd.h> #include "pnode.h" @@ -958,6 +959,9 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) }; const struct proc_fs_info *fs_infop; + if (sb->s_nr_dentry_max != INT_MAX) + seq_printf(m, ",vfs_dcache_size=%d",sb->s_nr_dentry_max); + for (fs_infop = fs_info; fs_infop->flag; fs_infop++) { if (sb->s_flags & fs_infop->flag) seq_puts(m, fs_infop->str); @@ -2271,6 +2275,94 @@ int copy_mount_string(const void __user *data, char **where) return 0; } +static const match_table_t tokens = { + {1, "vfs_dcache_size=%u"}, +}; + +struct vfs_options { + unsigned long vfs_dcache_size; +}; + +/** + * Generic option parsing for the VFS. + * + * Since most of the filesystems already do their own option parsing, and with + * very few code shared between them, this function strips out any options that + * we succeed in parsing ourselves. Passing them forward would just give the + * underlying fs an option it does not expect, leading it to fail. + * + * We don't yet have a pointer to the super block as well, since this is + * pre-mount. We accumulate in struct vfs_options whatever data we collected, + * and act on it later. + */ +static int vfs_parse_options(char *options, struct vfs_options *ops) +{ + substring_t args[MAX_OPT_ARGS]; + int option; + char *p; + char *opt; + char *start = NULL; + int ret; + + if (!options) + return 0; + + opt = kstrdup(options, GFP_KERNEL); + if (!opt) + return 1; + + ret = 1; + + start = opt; + while ((p = strsep(&opt, ",")) != NULL) { + int token; + if (!*p) + continue; + + /* + * Initialize args struct so we know whether arg was + * found; some options take optional arguments. + */ + args[0].to = args[0].from = 0; + token = match_token(p, tokens, args); + switch (token) { + case 1: + if (!args[0].from) + break; + + if (match_int(&args[0], &option)) + break; + + if (option < DCACHE_MIN_SIZE) { + printk(KERN_INFO "dcache size %d smaller than " + "minimum (%d)\n", option, DCACHE_MIN_SIZE); + option = DCACHE_MIN_SIZE; + } + + ops->vfs_dcache_size = option; + + /* + * The actual filesystems don't expect any option + * they don't understand to be received in the option + * string. So we strip off anything we processed, and + * give them a clean options string. + */ + ret = 0; + if (!opt) /* it is the last option listed */ + *(options + (p - start)) = '\0'; + else + strcpy(options + (p - start), opt); + break; + default: + ret = 0; + break; + } + } + + kfree(start); + return ret; +} + /* * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to * be given to the mount() call (ie: read-only, no-dev, no-suid etc). @@ -2291,6 +2383,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, struct path path; int retval = 0; int mnt_flags = 0; + struct vfs_options vfs_options; /* Discard magic */ if ((flags & MS_MGC_MSK) == MS_MGC_VAL) @@ -2318,6 +2411,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, if (!(flags & MS_NOATIME)) mnt_flags |= MNT_RELATIME; + + vfs_options.vfs_dcache_size = INT_MAX; + retval = vfs_parse_options(data_page, &vfs_options); + if (retval) + goto dput_out; + /* Separate the per-mountpoint flags */ if (flags & MS_NOSUID) mnt_flags |= MNT_NOSUID; @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, else retval = do_new_mount(&path, type_page, flags, mnt_flags, dev_name, data_page); + + /* bind mounts get to respect their parents decision */ + if (!retval && !(flags & MS_BIND)) + vfs_set_dcache_size(path.mnt->mnt_sb, + vfs_options.vfs_dcache_size); + dput_out: path_put(&path); return retval; diff --git a/include/linux/dcache.h b/include/linux/dcache.h index d37d2a7..1a309f3 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern int vfs_set_dcache_size(struct super_block *sb, int size); + +#define DCACHE_MIN_SIZE 1024 +extern int vfs_set_dcache_size(struct super_block *sb, int size); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 4/4] parse options in the vfs level @ 2011-08-14 15:13 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, David Chinner, Glauber Costa, Eric Dumazet This patch introduces a simple generic vfs option parser. Right now, the only option we have is to limit the size of the dcache. So any user that wants to have a dcache entries limit, can specify: mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint> It is supposed to work well with remounts, allowing it to change multiple over the course of the filesystem's lifecycle. I find mount a natural interface for handling filesystem options, so that's what I've choosen. Feel free to yell at it at will if you disagree. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Dave Chinner <david@fromorbit.com> CC: Eric Dumazet <eric.dumazet@gmail.com> --- fs/dcache.c | 7 +++ fs/namespace.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dcache.h | 4 ++ 3 files changed, 116 insertions(+), 0 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index ddd02a2..0551305 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1205,6 +1205,13 @@ static inline int dcache_mem_check(struct super_block *sb) return -ENOMEM; } +int vfs_set_dcache_size(struct super_block *sb, int size) +{ + sb->s_nr_dentry_max = size; + + return dcache_mem_check(sb); +} + /** * __d_alloc - allocate a dcache entry * @sb: filesystem it will belong to diff --git a/fs/namespace.c b/fs/namespace.c index 22bfe82..43b2cdb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -31,6 +31,7 @@ #include <linux/idr.h> #include <linux/fs_struct.h> #include <linux/fsnotify.h> +#include <linux/parser.h> #include <asm/uaccess.h> #include <asm/unistd.h> #include "pnode.h" @@ -958,6 +959,9 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) }; const struct proc_fs_info *fs_infop; + if (sb->s_nr_dentry_max != INT_MAX) + seq_printf(m, ",vfs_dcache_size=%d",sb->s_nr_dentry_max); + for (fs_infop = fs_info; fs_infop->flag; fs_infop++) { if (sb->s_flags & fs_infop->flag) seq_puts(m, fs_infop->str); @@ -2271,6 +2275,94 @@ int copy_mount_string(const void __user *data, char **where) return 0; } +static const match_table_t tokens = { + {1, "vfs_dcache_size=%u"}, +}; + +struct vfs_options { + unsigned long vfs_dcache_size; +}; + +/** + * Generic option parsing for the VFS. + * + * Since most of the filesystems already do their own option parsing, and with + * very few code shared between them, this function strips out any options that + * we succeed in parsing ourselves. Passing them forward would just give the + * underlying fs an option it does not expect, leading it to fail. + * + * We don't yet have a pointer to the super block as well, since this is + * pre-mount. We accumulate in struct vfs_options whatever data we collected, + * and act on it later. + */ +static int vfs_parse_options(char *options, struct vfs_options *ops) +{ + substring_t args[MAX_OPT_ARGS]; + int option; + char *p; + char *opt; + char *start = NULL; + int ret; + + if (!options) + return 0; + + opt = kstrdup(options, GFP_KERNEL); + if (!opt) + return 1; + + ret = 1; + + start = opt; + while ((p = strsep(&opt, ",")) != NULL) { + int token; + if (!*p) + continue; + + /* + * Initialize args struct so we know whether arg was + * found; some options take optional arguments. + */ + args[0].to = args[0].from = 0; + token = match_token(p, tokens, args); + switch (token) { + case 1: + if (!args[0].from) + break; + + if (match_int(&args[0], &option)) + break; + + if (option < DCACHE_MIN_SIZE) { + printk(KERN_INFO "dcache size %d smaller than " + "minimum (%d)\n", option, DCACHE_MIN_SIZE); + option = DCACHE_MIN_SIZE; + } + + ops->vfs_dcache_size = option; + + /* + * The actual filesystems don't expect any option + * they don't understand to be received in the option + * string. So we strip off anything we processed, and + * give them a clean options string. + */ + ret = 0; + if (!opt) /* it is the last option listed */ + *(options + (p - start)) = '\0'; + else + strcpy(options + (p - start), opt); + break; + default: + ret = 0; + break; + } + } + + kfree(start); + return ret; +} + /* * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to * be given to the mount() call (ie: read-only, no-dev, no-suid etc). @@ -2291,6 +2383,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, struct path path; int retval = 0; int mnt_flags = 0; + struct vfs_options vfs_options; /* Discard magic */ if ((flags & MS_MGC_MSK) == MS_MGC_VAL) @@ -2318,6 +2411,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, if (!(flags & MS_NOATIME)) mnt_flags |= MNT_RELATIME; + + vfs_options.vfs_dcache_size = INT_MAX; + retval = vfs_parse_options(data_page, &vfs_options); + if (retval) + goto dput_out; + /* Separate the per-mountpoint flags */ if (flags & MS_NOSUID) mnt_flags |= MNT_NOSUID; @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, else retval = do_new_mount(&path, type_page, flags, mnt_flags, dev_name, data_page); + + /* bind mounts get to respect their parents decision */ + if (!retval && !(flags & MS_BIND)) + vfs_set_dcache_size(path.mnt->mnt_sb, + vfs_options.vfs_dcache_size); + dput_out: path_put(&path); return retval; diff --git a/include/linux/dcache.h b/include/linux/dcache.h index d37d2a7..1a309f3 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern int vfs_set_dcache_size(struct super_block *sb, int size); + +#define DCACHE_MIN_SIZE 1024 +extern int vfs_set_dcache_size(struct super_block *sb, int size); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); -- 1.7.6 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-14 15:13 ` Glauber Costa (?) @ 2011-08-14 15:39 ` Vasiliy Kulikov 2011-08-15 0:03 ` Glauber Costa 2011-08-15 0:03 ` Glauber Costa -1 siblings, 2 replies; 65+ messages in thread From: Vasiliy Kulikov @ 2011-08-14 15:39 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel Hi Glauber, On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote: > +/** > + * Generic option parsing for the VFS. > + * > + * Since most of the filesystems already do their own option parsing, and with > + * very few code shared between them, this function strips out any options that > + * we succeed in parsing ourselves. Passing them forward would just give the > + * underlying fs an option it does not expect, leading it to fail. > + * > + * We don't yet have a pointer to the super block as well, since this is > + * pre-mount. We accumulate in struct vfs_options whatever data we collected, > + * and act on it later. > + */ > +static int vfs_parse_options(char *options, struct vfs_options *ops) > +{ > + substring_t args[MAX_OPT_ARGS]; > + int option; > + char *p; > + char *opt; > + char *start = NULL; > + int ret; > + > + if (!options) > + return 0; > + > + opt = kstrdup(options, GFP_KERNEL); > + if (!opt) > + return 1; > + > + ret = 1; > + > + start = opt; > + while ((p = strsep(&opt, ",")) != NULL) { > + int token; > + if (!*p) > + continue; > + > + /* > + * Initialize args struct so we know whether arg was > + * found; some options take optional arguments. > + */ > + args[0].to = args[0].from = 0; > + token = match_token(p, tokens, args); > + switch (token) { > + case 1: > + if (!args[0].from) > + break; > + > + if (match_int(&args[0], &option)) > + break; What if there are 2 passed options and the second fails? mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP <dev> <mntpoint> In this case you leave the second option and pass it to the fs option parser (as you already set ret=0), which is wrong. I think you should explicitly return 1 where you know the option is related to VFS, but you failed to parse it. It would look even simplier than current code. (Yes, this is a rare situation, but I can imagine some program that automatically adds mount options to the existing list and passes it to mount.) > + if (option < DCACHE_MIN_SIZE) { > + printk(KERN_INFO "dcache size %d smaller than " > + "minimum (%d)\n", option, DCACHE_MIN_SIZE); > + option = DCACHE_MIN_SIZE; > + } > + > + ops->vfs_dcache_size = option; > + > + /* > + * The actual filesystems don't expect any option > + * they don't understand to be received in the option > + * string. So we strip off anything we processed, and > + * give them a clean options string. > + */ > + ret = 0; > + if (!opt) /* it is the last option listed */ > + *(options + (p - start)) = '\0'; > + else > + strcpy(options + (p - start), opt); > + break; > + default: > + ret = 0; > + break; > + } > + } > + > + kfree(start); > + return ret; > +} Thanks, -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-14 15:39 ` Vasiliy Kulikov @ 2011-08-15 0:03 ` Glauber Costa 2011-08-15 0:03 ` Glauber Costa 1 sibling, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-15 0:03 UTC (permalink / raw) To: Vasiliy Kulikov Cc: linux-kernel, Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, David Chinner, containers, Hugh Dickins, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel On 08/14/2011 08:39 AM, Vasiliy Kulikov wrote: > Hi Glauber, > > On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote: >> +/** >> + * Generic option parsing for the VFS. >> + * >> + * Since most of the filesystems already do their own option parsing, and with >> + * very few code shared between them, this function strips out any options that >> + * we succeed in parsing ourselves. Passing them forward would just give the >> + * underlying fs an option it does not expect, leading it to fail. >> + * >> + * We don't yet have a pointer to the super block as well, since this is >> + * pre-mount. We accumulate in struct vfs_options whatever data we collected, >> + * and act on it later. >> + */ >> +static int vfs_parse_options(char *options, struct vfs_options *ops) >> +{ >> + substring_t args[MAX_OPT_ARGS]; >> + int option; >> + char *p; >> + char *opt; >> + char *start = NULL; >> + int ret; >> + >> + if (!options) >> + return 0; >> + >> + opt = kstrdup(options, GFP_KERNEL); >> + if (!opt) >> + return 1; >> + >> + ret = 1; >> + >> + start = opt; >> + while ((p = strsep(&opt, ",")) != NULL) { >> + int token; >> + if (!*p) >> + continue; >> + >> + /* >> + * Initialize args struct so we know whether arg was >> + * found; some options take optional arguments. >> + */ >> + args[0].to = args[0].from = 0; >> + token = match_token(p, tokens, args); >> + switch (token) { >> + case 1: >> + if (!args[0].from) >> + break; >> + >> + if (match_int(&args[0],&option)) >> + break; > > What if there are 2 passed options and the second fails? > > mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP<dev> <mntpoint> > > In this case you leave the second option and pass it to the fs option > parser (as you already set ret=0), which is wrong. I think you should > explicitly return 1 where you know the option is related to VFS, but you > failed to parse it. It would look even simplier than current code. Good point, thank you. I agree. > (Yes, this is a rare situation, but I can imagine some program that > automatically adds mount options to the existing list and passes it to > mount.) > Absolutely. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-14 15:39 ` Vasiliy Kulikov 2011-08-15 0:03 ` Glauber Costa @ 2011-08-15 0:03 ` Glauber Costa 1 sibling, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-15 0:03 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, David Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 08/14/2011 08:39 AM, Vasiliy Kulikov wrote: > Hi Glauber, > > On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote: >> +/** >> + * Generic option parsing for the VFS. >> + * >> + * Since most of the filesystems already do their own option parsing, and with >> + * very few code shared between them, this function strips out any options that >> + * we succeed in parsing ourselves. Passing them forward would just give the >> + * underlying fs an option it does not expect, leading it to fail. >> + * >> + * We don't yet have a pointer to the super block as well, since this is >> + * pre-mount. We accumulate in struct vfs_options whatever data we collected, >> + * and act on it later. >> + */ >> +static int vfs_parse_options(char *options, struct vfs_options *ops) >> +{ >> + substring_t args[MAX_OPT_ARGS]; >> + int option; >> + char *p; >> + char *opt; >> + char *start = NULL; >> + int ret; >> + >> + if (!options) >> + return 0; >> + >> + opt = kstrdup(options, GFP_KERNEL); >> + if (!opt) >> + return 1; >> + >> + ret = 1; >> + >> + start = opt; >> + while ((p = strsep(&opt, ",")) != NULL) { >> + int token; >> + if (!*p) >> + continue; >> + >> + /* >> + * Initialize args struct so we know whether arg was >> + * found; some options take optional arguments. >> + */ >> + args[0].to = args[0].from = 0; >> + token = match_token(p, tokens, args); >> + switch (token) { >> + case 1: >> + if (!args[0].from) >> + break; >> + >> + if (match_int(&args[0],&option)) >> + break; > > What if there are 2 passed options and the second fails? > > mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP<dev> <mntpoint> > > In this case you leave the second option and pass it to the fs option > parser (as you already set ret=0), which is wrong. I think you should > explicitly return 1 where you know the option is related to VFS, but you > failed to parse it. It would look even simplier than current code. Good point, thank you. I agree. > (Yes, this is a rare situation, but I can imagine some program that > automatically adds mount options to the existing list and passes it to > mount.) > Absolutely. ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 4/4] parse options in the vfs level [not found] ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-14 15:39 ` Vasiliy Kulikov 2011-08-15 7:09 ` Dave Chinner 1 sibling, 0 replies; 65+ messages in thread From: Vasiliy Kulikov @ 2011-08-14 15:39 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, Hugh Dickins, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, David Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Hi Glauber, On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote: > +/** > + * Generic option parsing for the VFS. > + * > + * Since most of the filesystems already do their own option parsing, and with > + * very few code shared between them, this function strips out any options that > + * we succeed in parsing ourselves. Passing them forward would just give the > + * underlying fs an option it does not expect, leading it to fail. > + * > + * We don't yet have a pointer to the super block as well, since this is > + * pre-mount. We accumulate in struct vfs_options whatever data we collected, > + * and act on it later. > + */ > +static int vfs_parse_options(char *options, struct vfs_options *ops) > +{ > + substring_t args[MAX_OPT_ARGS]; > + int option; > + char *p; > + char *opt; > + char *start = NULL; > + int ret; > + > + if (!options) > + return 0; > + > + opt = kstrdup(options, GFP_KERNEL); > + if (!opt) > + return 1; > + > + ret = 1; > + > + start = opt; > + while ((p = strsep(&opt, ",")) != NULL) { > + int token; > + if (!*p) > + continue; > + > + /* > + * Initialize args struct so we know whether arg was > + * found; some options take optional arguments. > + */ > + args[0].to = args[0].from = 0; > + token = match_token(p, tokens, args); > + switch (token) { > + case 1: > + if (!args[0].from) > + break; > + > + if (match_int(&args[0], &option)) > + break; What if there are 2 passed options and the second fails? mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP <dev> <mntpoint> In this case you leave the second option and pass it to the fs option parser (as you already set ret=0), which is wrong. I think you should explicitly return 1 where you know the option is related to VFS, but you failed to parse it. It would look even simplier than current code. (Yes, this is a rare situation, but I can imagine some program that automatically adds mount options to the existing list and passes it to mount.) > + if (option < DCACHE_MIN_SIZE) { > + printk(KERN_INFO "dcache size %d smaller than " > + "minimum (%d)\n", option, DCACHE_MIN_SIZE); > + option = DCACHE_MIN_SIZE; > + } > + > + ops->vfs_dcache_size = option; > + > + /* > + * The actual filesystems don't expect any option > + * they don't understand to be received in the option > + * string. So we strip off anything we processed, and > + * give them a clean options string. > + */ > + ret = 0; > + if (!opt) /* it is the last option listed */ > + *(options + (p - start)) = '\0'; > + else > + strcpy(options + (p - start), opt); > + break; > + default: > + ret = 0; > + break; > + } > + } > + > + kfree(start); > + return ret; > +} Thanks, -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level [not found] ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-14 15:39 ` Vasiliy Kulikov @ 2011-08-15 7:09 ` Dave Chinner 1 sibling, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 7:09 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote: > This patch introduces a simple generic vfs option parser. > Right now, the only option we have is to limit the size of the dcache. > > So any user that wants to have a dcache entries limit, can specify: > > mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint> > > It is supposed to work well with remounts, allowing it to change > multiple over the course of the filesystem's lifecycle. > > I find mount a natural interface for handling filesystem options, > so that's what I've choosen. Feel free to yell at it at will if > you disagree. It's already been noted by both myself and Al that there is not a 1:1 mapping between mount point and superblocks.... > @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, > else > retval = do_new_mount(&path, type_page, flags, mnt_flags, > dev_name, data_page); > + > + /* bind mounts get to respect their parents decision */ > + if (!retval && !(flags & MS_BIND)) > + vfs_set_dcache_size(path.mnt->mnt_sb, > + vfs_options.vfs_dcache_size); > + And I'm not sure that silently ignoring it in certain cases is the best way around that problem. > dput_out: > path_put(&path); > return retval; > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index d37d2a7..1a309f3 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *); > > /* only used at mount-time */ > extern struct dentry * d_alloc_root(struct inode *); > +extern int vfs_set_dcache_size(struct super_block *sb, int size); > + > +#define DCACHE_MIN_SIZE 1024 > +extern int vfs_set_dcache_size(struct super_block *sb, int size); Where did this minimum size come from and why was that figure chosen? Needs documenting. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-14 15:13 ` Glauber Costa ` (2 preceding siblings ...) (?) @ 2011-08-15 7:09 ` Dave Chinner 2011-08-24 2:19 ` Glauber Costa 2011-08-24 2:19 ` Glauber Costa -1 siblings, 2 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-15 7:09 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote: > This patch introduces a simple generic vfs option parser. > Right now, the only option we have is to limit the size of the dcache. > > So any user that wants to have a dcache entries limit, can specify: > > mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint> > > It is supposed to work well with remounts, allowing it to change > multiple over the course of the filesystem's lifecycle. > > I find mount a natural interface for handling filesystem options, > so that's what I've choosen. Feel free to yell at it at will if > you disagree. It's already been noted by both myself and Al that there is not a 1:1 mapping between mount point and superblocks.... > @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, > else > retval = do_new_mount(&path, type_page, flags, mnt_flags, > dev_name, data_page); > + > + /* bind mounts get to respect their parents decision */ > + if (!retval && !(flags & MS_BIND)) > + vfs_set_dcache_size(path.mnt->mnt_sb, > + vfs_options.vfs_dcache_size); > + And I'm not sure that silently ignoring it in certain cases is the best way around that problem. > dput_out: > path_put(&path); > return retval; > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index d37d2a7..1a309f3 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *); > > /* only used at mount-time */ > extern struct dentry * d_alloc_root(struct inode *); > +extern int vfs_set_dcache_size(struct super_block *sb, int size); > + > +#define DCACHE_MIN_SIZE 1024 > +extern int vfs_set_dcache_size(struct super_block *sb, int size); Where did this minimum size come from and why was that figure chosen? Needs documenting. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-15 7:09 ` Dave Chinner @ 2011-08-24 2:19 ` Glauber Costa 2011-08-24 2:19 ` Glauber Costa 1 sibling, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-24 2:19 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet On 08/15/2011 04:09 AM, Dave Chinner wrote: > On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote: >> This patch introduces a simple generic vfs option parser. >> Right now, the only option we have is to limit the size of the dcache. >> >> So any user that wants to have a dcache entries limit, can specify: >> >> mount -o whatever_options,vfs_dcache_size=XXX<dev> <mntpoint> >> >> It is supposed to work well with remounts, allowing it to change >> multiple over the course of the filesystem's lifecycle. >> >> I find mount a natural interface for handling filesystem options, >> so that's what I've choosen. Feel free to yell at it at will if >> you disagree. > > It's already been noted by both myself and Al that there is not a > 1:1 mapping between mount point and superblocks.... I had skipped that one in my pending patches. Well, it's already been noted by me as well. And by anybody. The problem, however, is not new. Filesystems exports fs-specific options that operate on the superblock as well. And we've been living alright with them. >> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, >> else >> retval = do_new_mount(&path, type_page, flags, mnt_flags, >> dev_name, data_page); >> + >> + /* bind mounts get to respect their parents decision */ >> + if (!retval&& !(flags& MS_BIND)) >> + vfs_set_dcache_size(path.mnt->mnt_sb, >> + vfs_options.vfs_dcache_size); >> + > > And I'm not sure that silently ignoring it in certain cases > is the best way around that problem. It is not "in certain cases". It is : 1) When bind mounting, and 2) When an error occurs. Pretty well-defined. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 4/4] parse options in the vfs level 2011-08-15 7:09 ` Dave Chinner 2011-08-24 2:19 ` Glauber Costa @ 2011-08-24 2:19 ` Glauber Costa 1 sibling, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-24 2:19 UTC (permalink / raw) To: Dave Chinner Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 08/15/2011 04:09 AM, Dave Chinner wrote: > On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote: >> This patch introduces a simple generic vfs option parser. >> Right now, the only option we have is to limit the size of the dcache. >> >> So any user that wants to have a dcache entries limit, can specify: >> >> mount -o whatever_options,vfs_dcache_size=XXX<dev> <mntpoint> >> >> It is supposed to work well with remounts, allowing it to change >> multiple over the course of the filesystem's lifecycle. >> >> I find mount a natural interface for handling filesystem options, >> so that's what I've choosen. Feel free to yell at it at will if >> you disagree. > > It's already been noted by both myself and Al that there is not a > 1:1 mapping between mount point and superblocks.... I had skipped that one in my pending patches. Well, it's already been noted by me as well. And by anybody. The problem, however, is not new. Filesystems exports fs-specific options that operate on the superblock as well. And we've been living alright with them. >> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, >> else >> retval = do_new_mount(&path, type_page, flags, mnt_flags, >> dev_name, data_page); >> + >> + /* bind mounts get to respect their parents decision */ >> + if (!retval&& !(flags& MS_BIND)) >> + vfs_set_dcache_size(path.mnt->mnt_sb, >> + vfs_options.vfs_dcache_size); >> + > > And I'm not sure that silently ignoring it in certain cases > is the best way around that problem. It is not "in certain cases". It is : 1) When bind mounting, and 2) When an error occurs. Pretty well-defined. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation [not found] ` <1313334832-1150-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (3 preceding siblings ...) 2011-08-14 15:13 ` Glauber Costa @ 2011-08-17 5:43 ` Dave Chinner 4 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-17 5:43 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: > Hello, > > This series is just like v2, except it addresses > Eric's comments regarding percpu variables. > > Let me know if there are further comments, and > I'll promply address them as well. Otherwise, > I feel this is ready for inclusion Just out of couriousity, one thing I've noticed about dentries is that in general at any given point in time most dentries are unused. Under the workloads I'm testing, even when I have a million cached dentries, I only have roughly 7,000 accounted as used. That is, most of the dentries in the system are on a LRU and accounted in sb->s_nr_dentry_unused of their owner superblock. So rather than introduce a bunch of new infrastructure to track the number of dentries allocated, why not simply limit the number of dentries allowed on the LRU? We already track that, and the shrinker already operates on the LRU, so we don't really need any new infrastructure. The limiting can be lazily - we don't need to limit the growth of dentries until we start to run out of memory. If the superblock shrinker is aware of the limits, then when it gets called by memory reclaim it can do all the work of reducing the number of items on the LRU down to the threshold at that time. IOWs, the limit has no impact on performance until memory is scarce, at which time memory reclaim enforces the limits on LRU size and clean up happens automatically. This also avoids all the problems of setting a limit lower than the number of active dentries required for the workload (i.e. avoids spurious ENOMEM errors trying to allocate dentries), allows overcommitment when memory is plentiful (which will benefit performance) but it brings the caches back to defined limits when memory is not plentiful (which solves the problem you are having). That seems like a much better solution to me, because it doesn't impact the normal working of workloads in the containers and avoids having to work out what the correct minimum size of the cache for each workload is. It's much easier to work out how many extra cached dentries you want the container to be able to have when memory is scarce... What do you think? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-14 15:13 ` Glauber Costa (?) (?) @ 2011-08-17 5:43 ` Dave Chinner 2011-08-17 18:44 ` Glauber Costa 2011-08-17 18:44 ` Glauber Costa -1 siblings, 2 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-17 5:43 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: > Hello, > > This series is just like v2, except it addresses > Eric's comments regarding percpu variables. > > Let me know if there are further comments, and > I'll promply address them as well. Otherwise, > I feel this is ready for inclusion Just out of couriousity, one thing I've noticed about dentries is that in general at any given point in time most dentries are unused. Under the workloads I'm testing, even when I have a million cached dentries, I only have roughly 7,000 accounted as used. That is, most of the dentries in the system are on a LRU and accounted in sb->s_nr_dentry_unused of their owner superblock. So rather than introduce a bunch of new infrastructure to track the number of dentries allocated, why not simply limit the number of dentries allowed on the LRU? We already track that, and the shrinker already operates on the LRU, so we don't really need any new infrastructure. The limiting can be lazily - we don't need to limit the growth of dentries until we start to run out of memory. If the superblock shrinker is aware of the limits, then when it gets called by memory reclaim it can do all the work of reducing the number of items on the LRU down to the threshold at that time. IOWs, the limit has no impact on performance until memory is scarce, at which time memory reclaim enforces the limits on LRU size and clean up happens automatically. This also avoids all the problems of setting a limit lower than the number of active dentries required for the workload (i.e. avoids spurious ENOMEM errors trying to allocate dentries), allows overcommitment when memory is plentiful (which will benefit performance) but it brings the caches back to defined limits when memory is not plentiful (which solves the problem you are having). That seems like a much better solution to me, because it doesn't impact the normal working of workloads in the containers and avoids having to work out what the correct minimum size of the cache for each workload is. It's much easier to work out how many extra cached dentries you want the container to be able to have when memory is scarce... What do you think? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-17 5:43 ` Dave Chinner @ 2011-08-17 18:44 ` Glauber Costa 2011-08-17 18:44 ` Glauber Costa 1 sibling, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-17 18:44 UTC (permalink / raw) To: Dave Chinner Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 08/16/2011 10:43 PM, Dave Chinner wrote: > On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: >> Hello, >> >> This series is just like v2, except it addresses >> Eric's comments regarding percpu variables. >> >> Let me know if there are further comments, and >> I'll promply address them as well. Otherwise, >> I feel this is ready for inclusion Hi David, I am not answering everything now, since I'm travelling, but let me get to this one: > Just out of couriousity, one thing I've noticed about dentries is > that in general at any given point in time most dentries are unused. > Under the workloads I'm testing, even when I have a million cached > dentries, I only have roughly 7,000 accounted as used. That is, most > of the dentries in the system are on a LRU and accounted in > sb->s_nr_dentry_unused of their owner superblock. > > So rather than introduce a bunch of new infrastructure to track the > number of dentries allocated, why not simply limit the number of > dentries allowed on the LRU? We already track that, and the shrinker > already operates on the LRU, so we don't really need any new > infrastructure. Because this only works well for cooperative workloads. And we can't really assume that in the virtualization world. One container can come up with a bogus workload - not even hard to write - that has the sole purpose of punishing every resource sharer of him. That's why we're putting limits on pinnable kernel memory. Normal workloads won't do. > > The limiting can be lazily - we don't need to limit the growth of > dentries until we start to run out of memory. If the superblock > shrinker is aware of the limits, then when it gets called by memory > reclaim it can do all the work of reducing the number of items on > the LRU down to the threshold at that time. Well, this idea itself can be considered, independent of which path we're taking. We can, if we want, allow the dentry cache to grow indefinitely if we're out of memory pressure. But it kinda defies the purpose of a hard limit... If we allow the dcache to grow over the hard cap just because memory is plentiful, then when it is *not* plentiful, we might be full of pinned entries that we can't release - and screw up somebody else's container, that behaved well all the time. > > IOWs, the limit has no impact on performance until memory is scarce, > at which time memory reclaim enforces the limits on LRU size and > clean up happens automatically. > > This also avoids all the problems of setting a limit lower than the > number of active dentries required for the workload (i.e. avoids > spurious ENOMEM errors trying to allocate dentries), allows > overcommitment when memory is plentiful (which will benefit > performance) but it brings the caches back to defined limits when > memory is not plentiful (which solves the problem you are having). No, this is not really the problem we're having. See above. About ENOMEM, I don't really see what's wrong with them here. For a container, running out of his assigned kernel memory, should be exactly the same as running out of real physical memory. I do agree that it changes the feeling of the system a little bit, because it then happens more often. But it is still right in principle. > > That seems like a much better solution to me, because it doesn't > impact the normal working of workloads in the containers and avoids > having to work out what the correct minimum size of the cache for > each workload is. It's much easier to work out how many extra cached > dentries you want the container to be able to have when memory is > scarce... > > What do you think? I think that for the reasons I went through above, dcache limiting and LRU size capping are really not equivalent in terms of solving this problem. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-17 5:43 ` Dave Chinner 2011-08-17 18:44 ` Glauber Costa @ 2011-08-17 18:44 ` Glauber Costa 2011-08-18 1:27 ` Dave Chinner [not found] ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 2 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-17 18:44 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On 08/16/2011 10:43 PM, Dave Chinner wrote: > On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: >> Hello, >> >> This series is just like v2, except it addresses >> Eric's comments regarding percpu variables. >> >> Let me know if there are further comments, and >> I'll promply address them as well. Otherwise, >> I feel this is ready for inclusion Hi David, I am not answering everything now, since I'm travelling, but let me get to this one: > Just out of couriousity, one thing I've noticed about dentries is > that in general at any given point in time most dentries are unused. > Under the workloads I'm testing, even when I have a million cached > dentries, I only have roughly 7,000 accounted as used. That is, most > of the dentries in the system are on a LRU and accounted in > sb->s_nr_dentry_unused of their owner superblock. > > So rather than introduce a bunch of new infrastructure to track the > number of dentries allocated, why not simply limit the number of > dentries allowed on the LRU? We already track that, and the shrinker > already operates on the LRU, so we don't really need any new > infrastructure. Because this only works well for cooperative workloads. And we can't really assume that in the virtualization world. One container can come up with a bogus workload - not even hard to write - that has the sole purpose of punishing every resource sharer of him. That's why we're putting limits on pinnable kernel memory. Normal workloads won't do. > > The limiting can be lazily - we don't need to limit the growth of > dentries until we start to run out of memory. If the superblock > shrinker is aware of the limits, then when it gets called by memory > reclaim it can do all the work of reducing the number of items on > the LRU down to the threshold at that time. Well, this idea itself can be considered, independent of which path we're taking. We can, if we want, allow the dentry cache to grow indefinitely if we're out of memory pressure. But it kinda defies the purpose of a hard limit... If we allow the dcache to grow over the hard cap just because memory is plentiful, then when it is *not* plentiful, we might be full of pinned entries that we can't release - and screw up somebody else's container, that behaved well all the time. > > IOWs, the limit has no impact on performance until memory is scarce, > at which time memory reclaim enforces the limits on LRU size and > clean up happens automatically. > > This also avoids all the problems of setting a limit lower than the > number of active dentries required for the workload (i.e. avoids > spurious ENOMEM errors trying to allocate dentries), allows > overcommitment when memory is plentiful (which will benefit > performance) but it brings the caches back to defined limits when > memory is not plentiful (which solves the problem you are having). No, this is not really the problem we're having. See above. About ENOMEM, I don't really see what's wrong with them here. For a container, running out of his assigned kernel memory, should be exactly the same as running out of real physical memory. I do agree that it changes the feeling of the system a little bit, because it then happens more often. But it is still right in principle. > > That seems like a much better solution to me, because it doesn't > impact the normal working of workloads in the containers and avoids > having to work out what the correct minimum size of the cache for > each workload is. It's much easier to work out how many extra cached > dentries you want the container to be able to have when memory is > scarce... > > What do you think? I think that for the reasons I went through above, dcache limiting and LRU size capping are really not equivalent in terms of solving this problem. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-17 18:44 ` Glauber Costa @ 2011-08-18 1:27 ` Dave Chinner [not found] ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-18 1:27 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: > On 08/16/2011 10:43 PM, Dave Chinner wrote: > >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: > >>Hello, > >> > >>This series is just like v2, except it addresses > >>Eric's comments regarding percpu variables. > >> > >>Let me know if there are further comments, and > >>I'll promply address them as well. Otherwise, > >>I feel this is ready for inclusion > > Hi David, > > I am not answering everything now, since I'm travelling, but let me > get to this one: > > >Just out of couriousity, one thing I've noticed about dentries is > >that in general at any given point in time most dentries are unused. > >Under the workloads I'm testing, even when I have a million cached > >dentries, I only have roughly 7,000 accounted as used. That is, most > >of the dentries in the system are on a LRU and accounted in > >sb->s_nr_dentry_unused of their owner superblock. > > > >So rather than introduce a bunch of new infrastructure to track the > >number of dentries allocated, why not simply limit the number of > >dentries allowed on the LRU? We already track that, and the shrinker > >already operates on the LRU, so we don't really need any new > >infrastructure. > Because this only works well for cooperative workloads. And we can't > really assume that in the virtualization world. One container can > come up with a bogus workload - not even hard to write - that has > the sole purpose of punishing every resource sharer of him. Sure, but as I've said before you can prevent the container from consuming too many dentries (via a hard limit) simply by adding a inode quota per container. This is exactly the sort of uncooperative behaviour filesystem quotas were invented to prevent. Perhaps we should separate the DOS case from the normal (co-operative) use case. As i mentioned previously, your inode allocation based DOS (while (1); mkdir x; cd x; done type cases) example is trivial to prevent with quotas. It was claimed that is was not possible to prevent with filesystem quotas, I left proving that as an exercise for the reader,but I feel I need to re-iterate my point with an example. That is, if you can't create a million inodes in the container, you can't instantiate a million dentries in the container. For example, use project quotas on XFS to create directory tree containers with hard limits on the number of inodes: $ cat /etc/projects 12345:/mnt/scratch/projects/foo $ cat /etc/projid foo:12345 $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch $ mkdir -p /mnt/scratch/projects/foo $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch Setting up project foo (path /mnt/scratch/projects/foo)... Setting up project foo (path /mnt/scratch/projects/foo)... Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch $ sudo xfs_quota -c "quota -p foo" /mnt/scratch $ cd /mnt/scratch/projects/foo/ $ ~/src/fs_mark-3.3/dir-depth count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x! /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x $ It stopped at 1435 directories because the container (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS there. And rather than a ENOMEM error (could be caused by anything) the error is EDQUOT which is a clear indication that a resource limit has been hit. That's a far better failure from a user perspective because they know -exactly- why their application failed - the container resource limits are too low.... IOWs, you don't need to touch the dentry cache -at all- to provide the per-subtree hard resource limiting you are trying to acheive - filesystem quotas can already acheive that for you. Project quotas used in this manner (as directory tree quotas) provide exactly the "per-subtree" hard resource limiting that you were trying to acheive with your original dentry mobs proposal. > >The limiting can be lazily - we don't need to limit the growth of > >dentries until we start to run out of memory. If the superblock > >shrinker is aware of the limits, then when it gets called by memory > >reclaim it can do all the work of reducing the number of items on > >the LRU down to the threshold at that time. > > Well, this idea itself can be considered, independent of which path > we're taking. We can, if we want, allow the dentry cache to grow > indefinitely if we're out of memory pressure. But it kinda defies > the > purpose of a hard limit... See my comments above about filesystem quotas providing hard limits. > >IOWs, the limit has no impact on performance until memory is scarce, > >at which time memory reclaim enforces the limits on LRU size and > >clean up happens automatically. > > > >This also avoids all the problems of setting a limit lower than the > >number of active dentries required for the workload (i.e. avoids > >spurious ENOMEM errors trying to allocate dentries), allows > >overcommitment when memory is plentiful (which will benefit > >performance) but it brings the caches back to defined limits when > >memory is not plentiful (which solves the problem you are having). > No, this is not really the problem we're having. > See above. > > About ENOMEM, I don't really see what's wrong with them here. Your backup program runs inside the container. Filesystem traversal balloons the dentry cache footprint, and so it is likely to trigger spurious ENOMEM when trying to read files in the container because it can't allocate a dentry for random files as it traverses. That'll be fun when it comes to restoring backups and discovering they aren't complete.... There's also the "WTF caused the ENOMEM error" problem I mentioned earlier.... > For a > container, running out of his assigned kernel memory, should be > exactly the same as running out of real physical memory. I do agree > that it changes the feeling of the system a little bit, because it > then happens more often. But it is still right in principle. The difference is the degree - when the system runs out of memory, it tries -really hard- before failing the allocation. Indeed, it'll swap, it'll free memory in other subsystems, it'll back off on disk congestion, it will try multiple times to free memory, escalation priority each time it retries. IOws, it jumps through all sorts of hoops to free memory before it finally fails. And then the memory, more often than not, comes from some subsystem other than the dentry cache, so it is rare that a dentry allocation actually relies on the dentry cache (and only the dentry cache) being shrunk to provide memory for the new dentry. Your dentry hard limit is has no fallback or other mechanisms to try - if the VFS caches cannot be shrunk immediately, then ENOMEM will occur. There's no retries, there's no waiting for disk congestion to clear, there's no backoff, there's no increase in reclaim desparation as previous attempts to free dentries fail. This greatly increases the chances of ENOMEM from _d_alloc() way above when a normal machine would see because it doesn't have any of the functionality that memory reclaim has. And, fundamentally, that sort of complexity doesn't belong in the dentry cache... Another interesting case to consider is internally fragmented dentry cache slabs, where the active population of the pages is sparse. This sort of population density is quite common on machines with sustained long term multiple workload usage (exactly what you'd expect on a containerised system). Hence dentry allocation can be done without increasing memory footprint at all. Likewise, freeing dentries won't free any memory at all. In this case, what has your hard limit bought you? An ENOMEM error in a situation where memory allocation is actually free from a resource consumption perspective. These are the sorts of corner case problems that hard limits on cache sizes have. That's the problem I see with the hard limit approach: it looks simple, but it is full of corner cases when you look more deeply. Users are going to hit these corner cases and want to fix those "problems". We'll have to try to fix them without really even being able to reproduce them reliably. We'll end up growing heurisitics to try to detect when problems are about to happen, and complexity to try to avoid those corner case problems. We'll muddle along with something that sort of works for the cases we can reproduce, but ultimately is untestable and unverifiable. In contrast, a lazy lru limiting solution is simple to implement and verify and has none of the warts that hard limiting exposes to user applications. Hence I'd prefer to avoid all the warts of hard limiting by ignoring the DOS case that leads to requiring a hard limit as it can be solved by other existing means. Limiting the size of the inactive cache (generally dominates cache usage) seems like a much lower impact manner of acheiving the same thing. Like I said previously - I've had people asking me whether limiting the size of the inode cache is possible for the past 5 years, and all their use cases are solved by the lazy mechanism I described. I think that most of the OpenVZ dcache size problems will also go away with the lazy solution as well, as most workloads with a large dentry cache footprint don't actively reference (and therefore pin) the entire working set at the same time.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation @ 2011-08-18 1:27 ` Dave Chinner 0 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-18 1:27 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: > On 08/16/2011 10:43 PM, Dave Chinner wrote: > >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: > >>Hello, > >> > >>This series is just like v2, except it addresses > >>Eric's comments regarding percpu variables. > >> > >>Let me know if there are further comments, and > >>I'll promply address them as well. Otherwise, > >>I feel this is ready for inclusion > > Hi David, > > I am not answering everything now, since I'm travelling, but let me > get to this one: > > >Just out of couriousity, one thing I've noticed about dentries is > >that in general at any given point in time most dentries are unused. > >Under the workloads I'm testing, even when I have a million cached > >dentries, I only have roughly 7,000 accounted as used. That is, most > >of the dentries in the system are on a LRU and accounted in > >sb->s_nr_dentry_unused of their owner superblock. > > > >So rather than introduce a bunch of new infrastructure to track the > >number of dentries allocated, why not simply limit the number of > >dentries allowed on the LRU? We already track that, and the shrinker > >already operates on the LRU, so we don't really need any new > >infrastructure. > Because this only works well for cooperative workloads. And we can't > really assume that in the virtualization world. One container can > come up with a bogus workload - not even hard to write - that has > the sole purpose of punishing every resource sharer of him. Sure, but as I've said before you can prevent the container from consuming too many dentries (via a hard limit) simply by adding a inode quota per container. This is exactly the sort of uncooperative behaviour filesystem quotas were invented to prevent. Perhaps we should separate the DOS case from the normal (co-operative) use case. As i mentioned previously, your inode allocation based DOS (while (1); mkdir x; cd x; done type cases) example is trivial to prevent with quotas. It was claimed that is was not possible to prevent with filesystem quotas, I left proving that as an exercise for the reader,but I feel I need to re-iterate my point with an example. That is, if you can't create a million inodes in the container, you can't instantiate a million dentries in the container. For example, use project quotas on XFS to create directory tree containers with hard limits on the number of inodes: $ cat /etc/projects 12345:/mnt/scratch/projects/foo $ cat /etc/projid foo:12345 $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch $ mkdir -p /mnt/scratch/projects/foo $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch Setting up project foo (path /mnt/scratch/projects/foo)... Setting up project foo (path /mnt/scratch/projects/foo)... Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch $ sudo xfs_quota -c "quota -p foo" /mnt/scratch $ cd /mnt/scratch/projects/foo/ $ ~/src/fs_mark-3.3/dir-depth count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x $ It stopped at 1435 directories because the container (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS there. And rather than a ENOMEM error (could be caused by anything) the error is EDQUOT which is a clear indication that a resource limit has been hit. That's a far better failure from a user perspective because they know -exactly- why their application failed - the container resource limits are too low.... IOWs, you don't need to touch the dentry cache -at all- to provide the per-subtree hard resource limiting you are trying to acheive - filesystem quotas can already acheive that for you. Project quotas used in this manner (as directory tree quotas) provide exactly the "per-subtree" hard resource limiting that you were trying to acheive with your original dentry mobs proposal. > >The limiting can be lazily - we don't need to limit the growth of > >dentries until we start to run out of memory. If the superblock > >shrinker is aware of the limits, then when it gets called by memory > >reclaim it can do all the work of reducing the number of items on > >the LRU down to the threshold at that time. > > Well, this idea itself can be considered, independent of which path > we're taking. We can, if we want, allow the dentry cache to grow > indefinitely if we're out of memory pressure. But it kinda defies > the > purpose of a hard limit... See my comments above about filesystem quotas providing hard limits. > >IOWs, the limit has no impact on performance until memory is scarce, > >at which time memory reclaim enforces the limits on LRU size and > >clean up happens automatically. > > > >This also avoids all the problems of setting a limit lower than the > >number of active dentries required for the workload (i.e. avoids > >spurious ENOMEM errors trying to allocate dentries), allows > >overcommitment when memory is plentiful (which will benefit > >performance) but it brings the caches back to defined limits when > >memory is not plentiful (which solves the problem you are having). > No, this is not really the problem we're having. > See above. > > About ENOMEM, I don't really see what's wrong with them here. Your backup program runs inside the container. Filesystem traversal balloons the dentry cache footprint, and so it is likely to trigger spurious ENOMEM when trying to read files in the container because it can't allocate a dentry for random files as it traverses. That'll be fun when it comes to restoring backups and discovering they aren't complete.... There's also the "WTF caused the ENOMEM error" problem I mentioned earlier.... > For a > container, running out of his assigned kernel memory, should be > exactly the same as running out of real physical memory. I do agree > that it changes the feeling of the system a little bit, because it > then happens more often. But it is still right in principle. The difference is the degree - when the system runs out of memory, it tries -really hard- before failing the allocation. Indeed, it'll swap, it'll free memory in other subsystems, it'll back off on disk congestion, it will try multiple times to free memory, escalation priority each time it retries. IOws, it jumps through all sorts of hoops to free memory before it finally fails. And then the memory, more often than not, comes from some subsystem other than the dentry cache, so it is rare that a dentry allocation actually relies on the dentry cache (and only the dentry cache) being shrunk to provide memory for the new dentry. Your dentry hard limit is has no fallback or other mechanisms to try - if the VFS caches cannot be shrunk immediately, then ENOMEM will occur. There's no retries, there's no waiting for disk congestion to clear, there's no backoff, there's no increase in reclaim desparation as previous attempts to free dentries fail. This greatly increases the chances of ENOMEM from _d_alloc() way above when a normal machine would see because it doesn't have any of the functionality that memory reclaim has. And, fundamentally, that sort of complexity doesn't belong in the dentry cache... Another interesting case to consider is internally fragmented dentry cache slabs, where the active population of the pages is sparse. This sort of population density is quite common on machines with sustained long term multiple workload usage (exactly what you'd expect on a containerised system). Hence dentry allocation can be done without increasing memory footprint at all. Likewise, freeing dentries won't free any memory at all. In this case, what has your hard limit bought you? An ENOMEM error in a situation where memory allocation is actually free from a resource consumption perspective. These are the sorts of corner case problems that hard limits on cache sizes have. That's the problem I see with the hard limit approach: it looks simple, but it is full of corner cases when you look more deeply. Users are going to hit these corner cases and want to fix those "problems". We'll have to try to fix them without really even being able to reproduce them reliably. We'll end up growing heurisitics to try to detect when problems are about to happen, and complexity to try to avoid those corner case problems. We'll muddle along with something that sort of works for the cases we can reproduce, but ultimately is untestable and unverifiable. In contrast, a lazy lru limiting solution is simple to implement and verify and has none of the warts that hard limiting exposes to user applications. Hence I'd prefer to avoid all the warts of hard limiting by ignoring the DOS case that leads to requiring a hard limit as it can be solved by other existing means. Limiting the size of the inactive cache (generally dominates cache usage) seems like a much lower impact manner of acheiving the same thing. Like I said previously - I've had people asking me whether limiting the size of the inode cache is possible for the past 5 years, and all their use cases are solved by the lazy mechanism I described. I think that most of the OpenVZ dcache size problems will also go away with the lazy solution as well, as most workloads with a large dentry cache footprint don't actively reference (and therefore pin) the entire working set at the same time.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-18 1:27 ` Dave Chinner (?) @ 2011-08-22 11:42 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw) To: Dave Chinner Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 08/17/2011 10:27 PM, Dave Chinner wrote: > On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: >> On 08/16/2011 10:43 PM, Dave Chinner wrote: >>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: >>>> Hello, >>>> >>>> This series is just like v2, except it addresses >>>> Eric's comments regarding percpu variables. >>>> >>>> Let me know if there are further comments, and >>>> I'll promply address them as well. Otherwise, >>>> I feel this is ready for inclusion >> >> Hi David, >> >> I am not answering everything now, since I'm travelling, but let me >> get to this one: >> >>> Just out of couriousity, one thing I've noticed about dentries is >>> that in general at any given point in time most dentries are unused. >>> Under the workloads I'm testing, even when I have a million cached >>> dentries, I only have roughly 7,000 accounted as used. That is, most >>> of the dentries in the system are on a LRU and accounted in >>> sb->s_nr_dentry_unused of their owner superblock. >>> >>> So rather than introduce a bunch of new infrastructure to track the >>> number of dentries allocated, why not simply limit the number of >>> dentries allowed on the LRU? We already track that, and the shrinker >>> already operates on the LRU, so we don't really need any new >>> infrastructure. >> Because this only works well for cooperative workloads. And we can't >> really assume that in the virtualization world. One container can >> come up with a bogus workload - not even hard to write - that has >> the sole purpose of punishing every resource sharer of him. > > Sure, but as I've said before you can prevent the container from > consuming too many dentries (via a hard limit) simply by adding a > inode quota per container. This is exactly the sort of > uncooperative behaviour filesystem quotas were invented to > prevent. > > Perhaps we should separate the DOS case from the normal > (co-operative) use case. > > As i mentioned previously, your inode allocation based DOS (while > (1); mkdir x; cd x; done type cases) example is trivial to prevent > with quotas. It was claimed that is was not possible to prevent with > filesystem quotas, I left proving that as an exercise for the > reader,but I feel I need to re-iterate my point with an example. > > That is, if you can't create a million inodes in the container, you > can't instantiate a million dentries in the container. For example, > use project quotas on XFS to create directory tree containers with > hard limits on the number of inodes: David, The dentry -> inode relationship is a N:1 relationship. Therefore, it is hard to believe that your example below would still work if we were trying to fill the cache through link operations, instead of operations like mkdir, that enforce a 1:1 relationship. Caping the dentry numbers, OTOH, caps the # of inodes as well. Although we *do* can have inodes lying around in the caches without an associated dentry at some point in time, we cannot have inodes *pinned* into the cache without an associated dentry. So they will soon enough go away. So maybe it is the other way around here, and it is people that wants an inode capping that should model it after a dentry cache capping. > > $ cat /etc/projects > 12345:/mnt/scratch/projects/foo > $ cat /etc/projid > foo:12345 > $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch > $ mkdir -p /mnt/scratch/projects/foo > $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch > Setting up project foo (path /mnt/scratch/projects/foo)... > Setting up project foo (path /mnt/scratch/projects/foo)... > Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). > $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch > $ sudo xfs_quota -c "quota -p foo" /mnt/scratch > $ cd /mnt/scratch/projects/foo/ > $ ~/src/fs_mark-3.3/dir-depth > count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x > $ > > It stopped at 1435 directories because the container > (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS > there. As said above, mkdir enforces a 1:1 relationship (because directories can't be hard linked) that can't be guaranteed in the general case. For the general case, one can have a dentry cache bigger than Linus' ego while instantiating only one inode in the process. > And rather than a ENOMEM error (could be caused by > anything) the error is EDQUOT which is a clear indication that a > resource limit has been hit. That's a far better failure from a user > perspective because they know -exactly- why their application > failed - the container resource limits are too low.... Well, if this is the problem, I am happy returning EDQUOT if we fail to find room for more dentries, or anything else we can agree upon instead of ENOMEM. > > IOWs, you don't need to touch the dentry cache -at all- to provide > the per-subtree hard resource limiting you are trying to acheive - > filesystem quotas can already acheive that for you. Project quotas > used in this manner (as directory tree quotas) provide exactly the > "per-subtree" hard resource limiting that you were trying to acheive > with your original dentry mobs proposal. well, I myself am done for now with the per-subtree proposal. I am completely fine with per-sb for a while now. > >>> The limiting can be lazily - we don't need to limit the growth of >>> dentries until we start to run out of memory. If the superblock >>> shrinker is aware of the limits, then when it gets called by memory >>> reclaim it can do all the work of reducing the number of items on >>> the LRU down to the threshold at that time. >> >> Well, this idea itself can be considered, independent of which path >> we're taking. We can, if we want, allow the dentry cache to grow >> indefinitely if we're out of memory pressure. But it kinda defies >> the >> purpose of a hard limit... > > See my comments above about filesystem quotas providing hard limits. > >>> IOWs, the limit has no impact on performance until memory is scarce, >>> at which time memory reclaim enforces the limits on LRU size and >>> clean up happens automatically. >>> >>> This also avoids all the problems of setting a limit lower than the >>> number of active dentries required for the workload (i.e. avoids >>> spurious ENOMEM errors trying to allocate dentries), allows >>> overcommitment when memory is plentiful (which will benefit >>> performance) but it brings the caches back to defined limits when >>> memory is not plentiful (which solves the problem you are having). >> No, this is not really the problem we're having. >> See above. >> >> About ENOMEM, I don't really see what's wrong with them here. > > Your backup program runs inside the container. Filesystem traversal > balloons the dentry cache footprint, and so it is likely to trigger > spurious ENOMEM when trying to read files in the container because > it can't allocate a dentry for random files as it traverses. That'll > be fun when it comes to restoring backups and discovering they > aren't complete.... > > There's also the "WTF caused the ENOMEM error" problem I mentioned > earlier.... > >> For a >> container, running out of his assigned kernel memory, should be >> exactly the same as running out of real physical memory. I do agree >> that it changes the feeling of the system a little bit, because it >> then happens more often. But it is still right in principle. > > The difference is the degree - when the system runs out of memory, > it tries -really hard- before failing the allocation. Indeed, it'll > swap, it'll free memory in other subsystems, it'll back off on disk > congestion, it will try multiple times to free memory, escalation > priority each time it retries. IOws, it jumps through all sorts of > hoops to free memory before it finally fails. And then the memory, > more often than not, comes from some subsystem other than the dentry > cache, so it is rare that a dentry allocation actually relies on the > dentry cache (and only the dentry cache) being shrunk to provide > memory for the new dentry. This is apples to oranges comparison. If instead of using the mechanism I proposed, we go for a quota-based mechanism like you mentioned, we'll fail just as often. Just with EDQUOT instead of ENOMEM. Humm.. while writing that I just looked back on the code, and it seems it will be hard to return anything but ENOMEM, since it is part of the interface contract. OTOH, the inode allocation function goes for the same kind of contract - returning NULL in case of error - meaning that everything of this kind that does not involve the filesystem will end up the same way. Be it in the icache, or dcache. > Your dentry hard limit is has no fallback or other mechanisms to try > - if the VFS caches cannot be shrunk immediately, then ENOMEM will > occur. There's no retries, there's no waiting for disk congestion > to clear, there's no backoff, there's no increase in reclaim > desparation as previous attempts to free dentries fail. This greatly > increases the chances of ENOMEM from _d_alloc() way above when a > normal machine would see because it doesn't have any of the > functionality that memory reclaim has. And, fundamentally, that sort > of complexity doesn't belong in the dentry cache... I don't see how/why a user application should care. An error means "Hey Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, we tried really hard, but yet could not do it". As far as a container is concerned, The only way to mimic the behavior you described would be to allow a single container to use up at most X bytes of general kernel memory. So when the dcache reaches the wall, it can borrow from somewhere else. Not that I am considering this... > > Another interesting case to consider is internally fragmented dentry > cache slabs, where the active population of the pages is sparse. > This sort of population density is quite common on machines with > sustained long term multiple workload usage (exactly what you'd > expect on a containerised system). Hence dentry allocation can be > done without increasing memory footprint at all. Likewise, freeing > dentries won't free any memory at all. In this case, what has your > hard limit bought you? An ENOMEM error in a situation where memory > allocation is actually free from a resource consumption perspective. Again, I don't think those hard limits are about used memory. So if freeing a dentry may not free memory, I'm still fine with that. If the kernel as a whole needs memory later, it can do something to reclaim it. *Unless* I hold a dentry reference. So the solution to me seems to be not allowing more than X to be held in the first place. Again: I couldn't care less about how much *memory* it is actually using at a certain point in time. > > These are the sorts of corner case problems that hard limits on > cache sizes have. That's the problem I see with the hard limit > approach: it looks simple, but it is full of corner cases when you > look more deeply. Users are going to hit these corner cases and > want to fix those "problems". We'll have to try to fix them without > really even being able to reproduce them reliably. We'll end up > growing heurisitics to try to detect when problems are about to > happen, and complexity to try to avoid those corner case problems. > We'll muddle along with something that sort of works for the cases > we can reproduce, but ultimately is untestable and unverifiable. In > contrast, a lazy lru limiting solution is simple to implement and > verify and has none of the warts that hard limiting exposes to user > applications. What you described does not seem to me as a corner case. "By using this option you can't use more than X entries, if you do, you'll fail" sounds pretty precise to me. > > Hence I'd prefer to avoid all the warts of hard limiting by ignoring > the DOS case that leads to requiring a hard limit as it can be > solved by other existing means. Limiting the size of the inactive > cache (generally dominates cache usage) seems like a much lower > impact manner of acheiving the same thing. Again, I understand you, but I don't think we're really solving the same thing. > Like I said previously - I've had people asking me whether limiting > the size of the inode cache is possible for the past 5 years, and > all their use cases are solved by the lazy mechanism I described. I > think that most of the OpenVZ dcache size problems will also go away > with the lazy solution as well, as most workloads with a large > dentry cache footprint don't actively reference (and therefore pin) > the entire working set at the same time.... Except for the malicious ones, of course. > Cheers, Thank you very much for your time, Dave! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation 2011-08-18 1:27 ` Dave Chinner @ 2011-08-22 11:42 ` Glauber Costa -1 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On 08/17/2011 10:27 PM, Dave Chinner wrote: > On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: >> On 08/16/2011 10:43 PM, Dave Chinner wrote: >>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: >>>> Hello, >>>> >>>> This series is just like v2, except it addresses >>>> Eric's comments regarding percpu variables. >>>> >>>> Let me know if there are further comments, and >>>> I'll promply address them as well. Otherwise, >>>> I feel this is ready for inclusion >> >> Hi David, >> >> I am not answering everything now, since I'm travelling, but let me >> get to this one: >> >>> Just out of couriousity, one thing I've noticed about dentries is >>> that in general at any given point in time most dentries are unused. >>> Under the workloads I'm testing, even when I have a million cached >>> dentries, I only have roughly 7,000 accounted as used. That is, most >>> of the dentries in the system are on a LRU and accounted in >>> sb->s_nr_dentry_unused of their owner superblock. >>> >>> So rather than introduce a bunch of new infrastructure to track the >>> number of dentries allocated, why not simply limit the number of >>> dentries allowed on the LRU? We already track that, and the shrinker >>> already operates on the LRU, so we don't really need any new >>> infrastructure. >> Because this only works well for cooperative workloads. And we can't >> really assume that in the virtualization world. One container can >> come up with a bogus workload - not even hard to write - that has >> the sole purpose of punishing every resource sharer of him. > > Sure, but as I've said before you can prevent the container from > consuming too many dentries (via a hard limit) simply by adding a > inode quota per container. This is exactly the sort of > uncooperative behaviour filesystem quotas were invented to > prevent. > > Perhaps we should separate the DOS case from the normal > (co-operative) use case. > > As i mentioned previously, your inode allocation based DOS (while > (1); mkdir x; cd x; done type cases) example is trivial to prevent > with quotas. It was claimed that is was not possible to prevent with > filesystem quotas, I left proving that as an exercise for the > reader,but I feel I need to re-iterate my point with an example. > > That is, if you can't create a million inodes in the container, you > can't instantiate a million dentries in the container. For example, > use project quotas on XFS to create directory tree containers with > hard limits on the number of inodes: David, The dentry -> inode relationship is a N:1 relationship. Therefore, it is hard to believe that your example below would still work if we were trying to fill the cache through link operations, instead of operations like mkdir, that enforce a 1:1 relationship. Caping the dentry numbers, OTOH, caps the # of inodes as well. Although we *do* can have inodes lying around in the caches without an associated dentry at some point in time, we cannot have inodes *pinned* into the cache without an associated dentry. So they will soon enough go away. So maybe it is the other way around here, and it is people that wants an inode capping that should model it after a dentry cache capping. > > $ cat /etc/projects > 12345:/mnt/scratch/projects/foo > $ cat /etc/projid > foo:12345 > $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch > $ mkdir -p /mnt/scratch/projects/foo > $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch > Setting up project foo (path /mnt/scratch/projects/foo)... > Setting up project foo (path /mnt/scratch/projects/foo)... > Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). > $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch > $ sudo xfs_quota -c "quota -p foo" /mnt/scratch > $ cd /mnt/scratch/projects/foo/ > $ ~/src/fs_mark-3.3/dir-depth > count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x > $ > > It stopped at 1435 directories because the container > (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS > there. As said above, mkdir enforces a 1:1 relationship (because directories can't be hard linked) that can't be guaranteed in the general case. For the general case, one can have a dentry cache bigger than Linus' ego while instantiating only one inode in the process. > And rather than a ENOMEM error (could be caused by > anything) the error is EDQUOT which is a clear indication that a > resource limit has been hit. That's a far better failure from a user > perspective because they know -exactly- why their application > failed - the container resource limits are too low.... Well, if this is the problem, I am happy returning EDQUOT if we fail to find room for more dentries, or anything else we can agree upon instead of ENOMEM. > > IOWs, you don't need to touch the dentry cache -at all- to provide > the per-subtree hard resource limiting you are trying to acheive - > filesystem quotas can already acheive that for you. Project quotas > used in this manner (as directory tree quotas) provide exactly the > "per-subtree" hard resource limiting that you were trying to acheive > with your original dentry mobs proposal. well, I myself am done for now with the per-subtree proposal. I am completely fine with per-sb for a while now. > >>> The limiting can be lazily - we don't need to limit the growth of >>> dentries until we start to run out of memory. If the superblock >>> shrinker is aware of the limits, then when it gets called by memory >>> reclaim it can do all the work of reducing the number of items on >>> the LRU down to the threshold at that time. >> >> Well, this idea itself can be considered, independent of which path >> we're taking. We can, if we want, allow the dentry cache to grow >> indefinitely if we're out of memory pressure. But it kinda defies >> the >> purpose of a hard limit... > > See my comments above about filesystem quotas providing hard limits. > >>> IOWs, the limit has no impact on performance until memory is scarce, >>> at which time memory reclaim enforces the limits on LRU size and >>> clean up happens automatically. >>> >>> This also avoids all the problems of setting a limit lower than the >>> number of active dentries required for the workload (i.e. avoids >>> spurious ENOMEM errors trying to allocate dentries), allows >>> overcommitment when memory is plentiful (which will benefit >>> performance) but it brings the caches back to defined limits when >>> memory is not plentiful (which solves the problem you are having). >> No, this is not really the problem we're having. >> See above. >> >> About ENOMEM, I don't really see what's wrong with them here. > > Your backup program runs inside the container. Filesystem traversal > balloons the dentry cache footprint, and so it is likely to trigger > spurious ENOMEM when trying to read files in the container because > it can't allocate a dentry for random files as it traverses. That'll > be fun when it comes to restoring backups and discovering they > aren't complete.... > > There's also the "WTF caused the ENOMEM error" problem I mentioned > earlier.... > >> For a >> container, running out of his assigned kernel memory, should be >> exactly the same as running out of real physical memory. I do agree >> that it changes the feeling of the system a little bit, because it >> then happens more often. But it is still right in principle. > > The difference is the degree - when the system runs out of memory, > it tries -really hard- before failing the allocation. Indeed, it'll > swap, it'll free memory in other subsystems, it'll back off on disk > congestion, it will try multiple times to free memory, escalation > priority each time it retries. IOws, it jumps through all sorts of > hoops to free memory before it finally fails. And then the memory, > more often than not, comes from some subsystem other than the dentry > cache, so it is rare that a dentry allocation actually relies on the > dentry cache (and only the dentry cache) being shrunk to provide > memory for the new dentry. This is apples to oranges comparison. If instead of using the mechanism I proposed, we go for a quota-based mechanism like you mentioned, we'll fail just as often. Just with EDQUOT instead of ENOMEM. Humm.. while writing that I just looked back on the code, and it seems it will be hard to return anything but ENOMEM, since it is part of the interface contract. OTOH, the inode allocation function goes for the same kind of contract - returning NULL in case of error - meaning that everything of this kind that does not involve the filesystem will end up the same way. Be it in the icache, or dcache. > Your dentry hard limit is has no fallback or other mechanisms to try > - if the VFS caches cannot be shrunk immediately, then ENOMEM will > occur. There's no retries, there's no waiting for disk congestion > to clear, there's no backoff, there's no increase in reclaim > desparation as previous attempts to free dentries fail. This greatly > increases the chances of ENOMEM from _d_alloc() way above when a > normal machine would see because it doesn't have any of the > functionality that memory reclaim has. And, fundamentally, that sort > of complexity doesn't belong in the dentry cache... I don't see how/why a user application should care. An error means "Hey Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, we tried really hard, but yet could not do it". As far as a container is concerned, The only way to mimic the behavior you described would be to allow a single container to use up at most X bytes of general kernel memory. So when the dcache reaches the wall, it can borrow from somewhere else. Not that I am considering this... > > Another interesting case to consider is internally fragmented dentry > cache slabs, where the active population of the pages is sparse. > This sort of population density is quite common on machines with > sustained long term multiple workload usage (exactly what you'd > expect on a containerised system). Hence dentry allocation can be > done without increasing memory footprint at all. Likewise, freeing > dentries won't free any memory at all. In this case, what has your > hard limit bought you? An ENOMEM error in a situation where memory > allocation is actually free from a resource consumption perspective. Again, I don't think those hard limits are about used memory. So if freeing a dentry may not free memory, I'm still fine with that. If the kernel as a whole needs memory later, it can do something to reclaim it. *Unless* I hold a dentry reference. So the solution to me seems to be not allowing more than X to be held in the first place. Again: I couldn't care less about how much *memory* it is actually using at a certain point in time. > > These are the sorts of corner case problems that hard limits on > cache sizes have. That's the problem I see with the hard limit > approach: it looks simple, but it is full of corner cases when you > look more deeply. Users are going to hit these corner cases and > want to fix those "problems". We'll have to try to fix them without > really even being able to reproduce them reliably. We'll end up > growing heurisitics to try to detect when problems are about to > happen, and complexity to try to avoid those corner case problems. > We'll muddle along with something that sort of works for the cases > we can reproduce, but ultimately is untestable and unverifiable. In > contrast, a lazy lru limiting solution is simple to implement and > verify and has none of the warts that hard limiting exposes to user > applications. What you described does not seem to me as a corner case. "By using this option you can't use more than X entries, if you do, you'll fail" sounds pretty precise to me. > > Hence I'd prefer to avoid all the warts of hard limiting by ignoring > the DOS case that leads to requiring a hard limit as it can be > solved by other existing means. Limiting the size of the inactive > cache (generally dominates cache usage) seems like a much lower > impact manner of acheiving the same thing. Again, I understand you, but I don't think we're really solving the same thing. > Like I said previously - I've had people asking me whether limiting > the size of the inode cache is possible for the past 5 years, and > all their use cases are solved by the lazy mechanism I described. I > think that most of the OpenVZ dcache size problems will also go away > with the lazy solution as well, as most workloads with a large > dentry cache footprint don't actively reference (and therefore pin) > the entire working set at the same time.... Except for the malicious ones, of course. > Cheers, Thank you very much for your time, Dave! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 0/4] Per-container dcache limitation @ 2011-08-22 11:42 ` Glauber Costa 0 siblings, 0 replies; 65+ messages in thread From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley On 08/17/2011 10:27 PM, Dave Chinner wrote: > On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: >> On 08/16/2011 10:43 PM, Dave Chinner wrote: >>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: >>>> Hello, >>>> >>>> This series is just like v2, except it addresses >>>> Eric's comments regarding percpu variables. >>>> >>>> Let me know if there are further comments, and >>>> I'll promply address them as well. Otherwise, >>>> I feel this is ready for inclusion >> >> Hi David, >> >> I am not answering everything now, since I'm travelling, but let me >> get to this one: >> >>> Just out of couriousity, one thing I've noticed about dentries is >>> that in general at any given point in time most dentries are unused. >>> Under the workloads I'm testing, even when I have a million cached >>> dentries, I only have roughly 7,000 accounted as used. That is, most >>> of the dentries in the system are on a LRU and accounted in >>> sb->s_nr_dentry_unused of their owner superblock. >>> >>> So rather than introduce a bunch of new infrastructure to track the >>> number of dentries allocated, why not simply limit the number of >>> dentries allowed on the LRU? We already track that, and the shrinker >>> already operates on the LRU, so we don't really need any new >>> infrastructure. >> Because this only works well for cooperative workloads. And we can't >> really assume that in the virtualization world. One container can >> come up with a bogus workload - not even hard to write - that has >> the sole purpose of punishing every resource sharer of him. > > Sure, but as I've said before you can prevent the container from > consuming too many dentries (via a hard limit) simply by adding a > inode quota per container. This is exactly the sort of > uncooperative behaviour filesystem quotas were invented to > prevent. > > Perhaps we should separate the DOS case from the normal > (co-operative) use case. > > As i mentioned previously, your inode allocation based DOS (while > (1); mkdir x; cd x; done type cases) example is trivial to prevent > with quotas. It was claimed that is was not possible to prevent with > filesystem quotas, I left proving that as an exercise for the > reader,but I feel I need to re-iterate my point with an example. > > That is, if you can't create a million inodes in the container, you > can't instantiate a million dentries in the container. For example, > use project quotas on XFS to create directory tree containers with > hard limits on the number of inodes: David, The dentry -> inode relationship is a N:1 relationship. Therefore, it is hard to believe that your example below would still work if we were trying to fill the cache through link operations, instead of operations like mkdir, that enforce a 1:1 relationship. Caping the dentry numbers, OTOH, caps the # of inodes as well. Although we *do* can have inodes lying around in the caches without an associated dentry at some point in time, we cannot have inodes *pinned* into the cache without an associated dentry. So they will soon enough go away. So maybe it is the other way around here, and it is people that wants an inode capping that should model it after a dentry cache capping. > > $ cat /etc/projects > 12345:/mnt/scratch/projects/foo > $ cat /etc/projid > foo:12345 > $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch > $ mkdir -p /mnt/scratch/projects/foo > $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch > Setting up project foo (path /mnt/scratch/projects/foo)... > Setting up project foo (path /mnt/scratch/projects/foo)... > Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). > $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch > $ sudo xfs_quota -c "quota -p foo" /mnt/scratch > $ cd /mnt/scratch/projects/foo/ > $ ~/src/fs_mark-3.3/dir-depth > count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x > $ > > It stopped at 1435 directories because the container > (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS > there. As said above, mkdir enforces a 1:1 relationship (because directories can't be hard linked) that can't be guaranteed in the general case. For the general case, one can have a dentry cache bigger than Linus' ego while instantiating only one inode in the process. > And rather than a ENOMEM error (could be caused by > anything) the error is EDQUOT which is a clear indication that a > resource limit has been hit. That's a far better failure from a user > perspective because they know -exactly- why their application > failed - the container resource limits are too low.... Well, if this is the problem, I am happy returning EDQUOT if we fail to find room for more dentries, or anything else we can agree upon instead of ENOMEM. > > IOWs, you don't need to touch the dentry cache -at all- to provide > the per-subtree hard resource limiting you are trying to acheive - > filesystem quotas can already acheive that for you. Project quotas > used in this manner (as directory tree quotas) provide exactly the > "per-subtree" hard resource limiting that you were trying to acheive > with your original dentry mobs proposal. well, I myself am done for now with the per-subtree proposal. I am completely fine with per-sb for a while now. > >>> The limiting can be lazily - we don't need to limit the growth of >>> dentries until we start to run out of memory. If the superblock >>> shrinker is aware of the limits, then when it gets called by memory >>> reclaim it can do all the work of reducing the number of items on >>> the LRU down to the threshold at that time. >> >> Well, this idea itself can be considered, independent of which path >> we're taking. We can, if we want, allow the dentry cache to grow >> indefinitely if we're out of memory pressure. But it kinda defies >> the >> purpose of a hard limit... > > See my comments above about filesystem quotas providing hard limits. > >>> IOWs, the limit has no impact on performance until memory is scarce, >>> at which time memory reclaim enforces the limits on LRU size and >>> clean up happens automatically. >>> >>> This also avoids all the problems of setting a limit lower than the >>> number of active dentries required for the workload (i.e. avoids >>> spurious ENOMEM errors trying to allocate dentries), allows >>> overcommitment when memory is plentiful (which will benefit >>> performance) but it brings the caches back to defined limits when >>> memory is not plentiful (which solves the problem you are having). >> No, this is not really the problem we're having. >> See above. >> >> About ENOMEM, I don't really see what's wrong with them here. > > Your backup program runs inside the container. Filesystem traversal > balloons the dentry cache footprint, and so it is likely to trigger > spurious ENOMEM when trying to read files in the container because > it can't allocate a dentry for random files as it traverses. That'll > be fun when it comes to restoring backups and discovering they > aren't complete.... > > There's also the "WTF caused the ENOMEM error" problem I mentioned > earlier.... > >> For a >> container, running out of his assigned kernel memory, should be >> exactly the same as running out of real physical memory. I do agree >> that it changes the feeling of the system a little bit, because it >> then happens more often. But it is still right in principle. > > The difference is the degree - when the system runs out of memory, > it tries -really hard- before failing the allocation. Indeed, it'll > swap, it'll free memory in other subsystems, it'll back off on disk > congestion, it will try multiple times to free memory, escalation > priority each time it retries. IOws, it jumps through all sorts of > hoops to free memory before it finally fails. And then the memory, > more often than not, comes from some subsystem other than the dentry > cache, so it is rare that a dentry allocation actually relies on the > dentry cache (and only the dentry cache) being shrunk to provide > memory for the new dentry. This is apples to oranges comparison. If instead of using the mechanism I proposed, we go for a quota-based mechanism like you mentioned, we'll fail just as often. Just with EDQUOT instead of ENOMEM. Humm.. while writing that I just looked back on the code, and it seems it will be hard to return anything but ENOMEM, since it is part of the interface contract. OTOH, the inode allocation function goes for the same kind of contract - returning NULL in case of error - meaning that everything of this kind that does not involve the filesystem will end up the same way. Be it in the icache, or dcache. > Your dentry hard limit is has no fallback or other mechanisms to try > - if the VFS caches cannot be shrunk immediately, then ENOMEM will > occur. There's no retries, there's no waiting for disk congestion > to clear, there's no backoff, there's no increase in reclaim > desparation as previous attempts to free dentries fail. This greatly > increases the chances of ENOMEM from _d_alloc() way above when a > normal machine would see because it doesn't have any of the > functionality that memory reclaim has. And, fundamentally, that sort > of complexity doesn't belong in the dentry cache... I don't see how/why a user application should care. An error means "Hey Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, we tried really hard, but yet could not do it". As far as a container is concerned, The only way to mimic the behavior you described would be to allow a single container to use up at most X bytes of general kernel memory. So when the dcache reaches the wall, it can borrow from somewhere else. Not that I am considering this... > > Another interesting case to consider is internally fragmented dentry > cache slabs, where the active population of the pages is sparse. > This sort of population density is quite common on machines with > sustained long term multiple workload usage (exactly what you'd > expect on a containerised system). Hence dentry allocation can be > done without increasing memory footprint at all. Likewise, freeing > dentries won't free any memory at all. In this case, what has your > hard limit bought you? An ENOMEM error in a situation where memory > allocation is actually free from a resource consumption perspective. Again, I don't think those hard limits are about used memory. So if freeing a dentry may not free memory, I'm still fine with that. If the kernel as a whole needs memory later, it can do something to reclaim it. *Unless* I hold a dentry reference. So the solution to me seems to be not allowing more than X to be held in the first place. Again: I couldn't care less about how much *memory* it is actually using at a certain point in time. > > These are the sorts of corner case problems that hard limits on > cache sizes have. That's the problem I see with the hard limit > approach: it looks simple, but it is full of corner cases when you > look more deeply. Users are going to hit these corner cases and > want to fix those "problems". We'll have to try to fix them without > really even being able to reproduce them reliably. We'll end up > growing heurisitics to try to detect when problems are about to > happen, and complexity to try to avoid those corner case problems. > We'll muddle along with something that sort of works for the cases > we can reproduce, but ultimately is untestable and unverifiable. In > contrast, a lazy lru limiting solution is simple to implement and > verify and has none of the warts that hard limiting exposes to user > applications. What you described does not seem to me as a corner case. "By using this option you can't use more than X entries, if you do, you'll fail" sounds pretty precise to me. > > Hence I'd prefer to avoid all the warts of hard limiting by ignoring > the DOS case that leads to requiring a hard limit as it can be > solved by other existing means. Limiting the size of the inactive > cache (generally dominates cache usage) seems like a much lower > impact manner of acheiving the same thing. Again, I understand you, but I don't think we're really solving the same thing. > Like I said previously - I've had people asking me whether limiting > the size of the inode cache is possible for the past 5 years, and > all their use cases are solved by the lazy mechanism I described. I > think that most of the OpenVZ dcache size problems will also go away > with the lazy solution as well, as most workloads with a large > dentry cache footprint don't actively reference (and therefore pin) > the entire working set at the same time.... Except for the malicious ones, of course. > Cheers, Thank you very much for your time, Dave! ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 0/4] Per-container dcache limitation [not found] ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2011-08-18 1:27 ` Dave Chinner 0 siblings, 0 replies; 65+ messages in thread From: Dave Chinner @ 2011-08-18 1:27 UTC (permalink / raw) To: Glauber Costa Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA, James Bottomley, Dave Hansen, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote: > On 08/16/2011 10:43 PM, Dave Chinner wrote: > >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote: > >>Hello, > >> > >>This series is just like v2, except it addresses > >>Eric's comments regarding percpu variables. > >> > >>Let me know if there are further comments, and > >>I'll promply address them as well. Otherwise, > >>I feel this is ready for inclusion > > Hi David, > > I am not answering everything now, since I'm travelling, but let me > get to this one: > > >Just out of couriousity, one thing I've noticed about dentries is > >that in general at any given point in time most dentries are unused. > >Under the workloads I'm testing, even when I have a million cached > >dentries, I only have roughly 7,000 accounted as used. That is, most > >of the dentries in the system are on a LRU and accounted in > >sb->s_nr_dentry_unused of their owner superblock. > > > >So rather than introduce a bunch of new infrastructure to track the > >number of dentries allocated, why not simply limit the number of > >dentries allowed on the LRU? We already track that, and the shrinker > >already operates on the LRU, so we don't really need any new > >infrastructure. > Because this only works well for cooperative workloads. And we can't > really assume that in the virtualization world. One container can > come up with a bogus workload - not even hard to write - that has > the sole purpose of punishing every resource sharer of him. Sure, but as I've said before you can prevent the container from consuming too many dentries (via a hard limit) simply by adding a inode quota per container. This is exactly the sort of uncooperative behaviour filesystem quotas were invented to prevent. Perhaps we should separate the DOS case from the normal (co-operative) use case. As i mentioned previously, your inode allocation based DOS (while (1); mkdir x; cd x; done type cases) example is trivial to prevent with quotas. It was claimed that is was not possible to prevent with filesystem quotas, I left proving that as an exercise for the reader,but I feel I need to re-iterate my point with an example. That is, if you can't create a million inodes in the container, you can't instantiate a million dentries in the container. For example, use project quotas on XFS to create directory tree containers with hard limits on the number of inodes: $ cat /etc/projects 12345:/mnt/scratch/projects/foo $ cat /etc/projid foo:12345 $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch $ mkdir -p /mnt/scratch/projects/foo $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch Setting up project foo (path /mnt/scratch/projects/foo)... Setting up project foo (path /mnt/scratch/projects/foo)... Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1). $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch $ sudo xfs_quota -c "quota -p foo" /mnt/scratch $ cd /mnt/scratch/projects/foo/ $ ~/src/fs_mark-3.3/dir-depth count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x! /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x! /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/ x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x $ It stopped at 1435 directories because the container (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS there. And rather than a ENOMEM error (could be caused by anything) the error is EDQUOT which is a clear indication that a resource limit has been hit. That's a far better failure from a user perspective because they know -exactly- why their application failed - the container resource limits are too low.... IOWs, you don't need to touch the dentry cache -at all- to provide the per-subtree hard resource limiting you are trying to acheive - filesystem quotas can already acheive that for you. Project quotas used in this manner (as directory tree quotas) provide exactly the "per-subtree" hard resource limiting that you were trying to acheive with your original dentry mobs proposal. > >The limiting can be lazily - we don't need to limit the growth of > >dentries until we start to run out of memory. If the superblock > >shrinker is aware of the limits, then when it gets called by memory > >reclaim it can do all the work of reducing the number of items on > >the LRU down to the threshold at that time. > > Well, this idea itself can be considered, independent of which path > we're taking. We can, if we want, allow the dentry cache to grow > indefinitely if we're out of memory pressure. But it kinda defies > the > purpose of a hard limit... See my comments above about filesystem quotas providing hard limits. > >IOWs, the limit has no impact on performance until memory is scarce, > >at which time memory reclaim enforces the limits on LRU size and > >clean up happens automatically. > > > >This also avoids all the problems of setting a limit lower than the > >number of active dentries required for the workload (i.e. avoids > >spurious ENOMEM errors trying to allocate dentries), allows > >overcommitment when memory is plentiful (which will benefit > >performance) but it brings the caches back to defined limits when > >memory is not plentiful (which solves the problem you are having). > No, this is not really the problem we're having. > See above. > > About ENOMEM, I don't really see what's wrong with them here. Your backup program runs inside the container. Filesystem traversal balloons the dentry cache footprint, and so it is likely to trigger spurious ENOMEM when trying to read files in the container because it can't allocate a dentry for random files as it traverses. That'll be fun when it comes to restoring backups and discovering they aren't complete.... There's also the "WTF caused the ENOMEM error" problem I mentioned earlier.... > For a > container, running out of his assigned kernel memory, should be > exactly the same as running out of real physical memory. I do agree > that it changes the feeling of the system a little bit, because it > then happens more often. But it is still right in principle. The difference is the degree - when the system runs out of memory, it tries -really hard- before failing the allocation. Indeed, it'll swap, it'll free memory in other subsystems, it'll back off on disk congestion, it will try multiple times to free memory, escalation priority each time it retries. IOws, it jumps through all sorts of hoops to free memory before it finally fails. And then the memory, more often than not, comes from some subsystem other than the dentry cache, so it is rare that a dentry allocation actually relies on the dentry cache (and only the dentry cache) being shrunk to provide memory for the new dentry. Your dentry hard limit is has no fallback or other mechanisms to try - if the VFS caches cannot be shrunk immediately, then ENOMEM will occur. There's no retries, there's no waiting for disk congestion to clear, there's no backoff, there's no increase in reclaim desparation as previous attempts to free dentries fail. This greatly increases the chances of ENOMEM from _d_alloc() way above when a normal machine would see because it doesn't have any of the functionality that memory reclaim has. And, fundamentally, that sort of complexity doesn't belong in the dentry cache... Another interesting case to consider is internally fragmented dentry cache slabs, where the active population of the pages is sparse. This sort of population density is quite common on machines with sustained long term multiple workload usage (exactly what you'd expect on a containerised system). Hence dentry allocation can be done without increasing memory footprint at all. Likewise, freeing dentries won't free any memory at all. In this case, what has your hard limit bought you? An ENOMEM error in a situation where memory allocation is actually free from a resource consumption perspective. These are the sorts of corner case problems that hard limits on cache sizes have. That's the problem I see with the hard limit approach: it looks simple, but it is full of corner cases when you look more deeply. Users are going to hit these corner cases and want to fix those "problems". We'll have to try to fix them without really even being able to reproduce them reliably. We'll end up growing heurisitics to try to detect when problems are about to happen, and complexity to try to avoid those corner case problems. We'll muddle along with something that sort of works for the cases we can reproduce, but ultimately is untestable and unverifiable. In contrast, a lazy lru limiting solution is simple to implement and verify and has none of the warts that hard limiting exposes to user applications. Hence I'd prefer to avoid all the warts of hard limiting by ignoring the DOS case that leads to requiring a hard limit as it can be solved by other existing means. Limiting the size of the inactive cache (generally dominates cache usage) seems like a much lower impact manner of acheiving the same thing. Like I said previously - I've had people asking me whether limiting the size of the inode cache is possible for the past 5 years, and all their use cases are solved by the lazy mechanism I described. I think that most of the OpenVZ dcache size problems will also go away with the lazy solution as well, as most workloads with a large dentry cache footprint don't actively reference (and therefore pin) the entire working set at the same time.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2011-08-24 2:20 UTC | newest] Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-14 15:13 [PATCH v3 0/4] Per-container dcache limitation Glauber Costa 2011-08-14 15:13 ` Glauber Costa [not found] ` <1313334832-1150-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-14 15:13 ` [PATCH v3 1/4] factor out single-shrinker code Glauber Costa 2011-08-14 15:13 ` Glauber Costa 2011-08-15 6:43 ` Dave Chinner [not found] ` <1313334832-1150-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 6:43 ` Dave Chinner 2011-08-14 15:13 ` [PATCH v3 2/4] Keep nr_dentry per super block Glauber Costa 2011-08-14 15:13 ` Glauber Costa 2011-08-14 17:38 ` Eric Dumazet 2011-08-14 17:38 ` Eric Dumazet [not found] ` <1313334832-1150-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-14 17:38 ` Eric Dumazet 2011-08-14 15:13 ` [PATCH v3 3/4] limit nr_dentries per superblock Glauber Costa 2011-08-14 15:13 ` Glauber Costa [not found] ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 7:03 ` Dave Chinner 2011-08-15 7:12 ` Pekka Enberg 2011-08-15 7:03 ` Dave Chinner 2011-08-15 7:12 ` Pekka Enberg [not found] ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 10:46 ` Dave Chinner 2011-08-15 10:46 ` Dave Chinner 2011-08-15 10:58 ` Pekka Enberg [not found] ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 11:05 ` Pavel Emelyanov 2011-08-15 11:05 ` Pavel Emelyanov 2011-08-15 11:05 ` Pavel Emelyanov 2011-08-15 11:14 ` Pekka Enberg 2011-08-15 11:14 ` Pekka Enberg 2011-08-15 11:32 ` Pavel Emelyanov 2011-08-15 11:32 ` Pavel Emelyanov 2011-08-15 11:55 ` Pekka Enberg 2011-08-15 11:55 ` Pekka Enberg 2011-08-15 12:12 ` Pavel Emelyanov 2011-08-15 12:12 ` Pavel Emelyanov 2011-08-15 12:23 ` Pekka Enberg 2011-08-15 12:23 ` Pekka Enberg 2011-08-15 12:37 ` Pavel Emelyanov 2011-08-15 12:37 ` Pavel Emelyanov [not found] ` <CAOJsxLEVfGrzUQv0fOpOyw3AaOLOHcWbvLJL1NdrHS6M2j5o1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 12:37 ` Pavel Emelyanov [not found] ` <4E490D47.8050105-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 12:23 ` Pekka Enberg [not found] ` <CAOJsxLErFcxuuqnWkRbOAHEFbeGrKp3YMZ-144=m16oBXCHJ2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 12:12 ` Pavel Emelyanov [not found] ` <4E4903C1.9050207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 11:55 ` Pekka Enberg [not found] ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-08-15 11:32 ` Pavel Emelyanov 2011-08-16 2:11 ` Dave Chinner 2011-08-16 2:11 ` Dave Chinner 2011-08-16 2:11 ` Dave Chinner [not found] ` <4E48FD8A.90401-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-15 11:14 ` Pekka Enberg 2011-08-15 10:58 ` Pekka Enberg 2011-08-14 15:13 ` [PATCH v3 4/4] parse options in the vfs level Glauber Costa 2011-08-14 15:13 ` Glauber Costa 2011-08-14 15:39 ` Vasiliy Kulikov 2011-08-15 0:03 ` Glauber Costa 2011-08-15 0:03 ` Glauber Costa [not found] ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-14 15:39 ` Vasiliy Kulikov 2011-08-15 7:09 ` Dave Chinner 2011-08-15 7:09 ` Dave Chinner 2011-08-24 2:19 ` Glauber Costa 2011-08-24 2:19 ` Glauber Costa 2011-08-17 5:43 ` [PATCH v3 0/4] Per-container dcache limitation Dave Chinner 2011-08-17 5:43 ` Dave Chinner 2011-08-17 18:44 ` Glauber Costa 2011-08-17 18:44 ` Glauber Costa 2011-08-18 1:27 ` Dave Chinner 2011-08-18 1:27 ` Dave Chinner 2011-08-22 11:42 ` Glauber Costa 2011-08-22 11:42 ` Glauber Costa 2011-08-22 11:42 ` Glauber Costa [not found] ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2011-08-18 1:27 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.