All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Per-container dcache limitation
@ 2011-08-14 15:13 ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hello,

This series is just like v2, except it addresses
Eric's comments regarding percpu variables.

Let me know if there are further comments, and
I'll promply address them as well. Otherwise,
I feel this is ready for inclusion

Glauber Costa (4):
  factor out single-shrinker code
  Keep nr_dentry per super block
  limit nr_dentries per superblock
  parse options in the vfs level

 fs/dcache.c              |   47 ++++++++++++-
 fs/namespace.c           |  105 ++++++++++++++++++++++++++
 fs/super.c               |    5 +
 include/linux/dcache.h   |    4 +
 include/linux/fs.h       |    3 +
 include/linux/shrinker.h |    6 ++
 mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
 7 files changed, 267 insertions(+), 88 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 0/4] Per-container dcache limitation
@ 2011-08-14 15:13 ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley, David Chinner, Glauber Costa

Hello,

This series is just like v2, except it addresses
Eric's comments regarding percpu variables.

Let me know if there are further comments, and
I'll promply address them as well. Otherwise,
I feel this is ready for inclusion

Glauber Costa (4):
  factor out single-shrinker code
  Keep nr_dentry per super block
  limit nr_dentries per superblock
  parse options in the vfs level

 fs/dcache.c              |   47 ++++++++++++-
 fs/namespace.c           |  105 ++++++++++++++++++++++++++
 fs/super.c               |    5 +
 include/linux/dcache.h   |    4 +
 include/linux/fs.h       |    3 +
 include/linux/shrinker.h |    6 ++
 mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
 7 files changed, 267 insertions(+), 88 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/4] factor out single-shrinker code
  2011-08-14 15:13 ` Glauber Costa
@ 2011-08-14 15:13     ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

While shrinking our caches, vmscan.c passes through all
registered shrinkers, trying to free objects as it goes.

We would like to do that individually for some caches,
like the dcache, when certain conditions apply (for
example, when we reach a soon-to-exist maximum allowed size)

To avoid re-writing the same logic at more than one place,
this patch factors out the shrink logic at shrink_one_shrinker(),
that we can call from other places of the kernel.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/shrinker.h |    6 ++
 mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
 2 files changed, 104 insertions(+), 87 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 790651b..c5db650 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -39,4 +39,10 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
+
+unsigned long shrink_one_shrinker(struct shrinker *shrinker,
+				  struct shrink_control *shrink,
+				  unsigned long nr_pages_scanned,                      
+				  unsigned long lru_pages);
+
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ef6912..50dfc61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
 }
 
 #define SHRINK_BATCH 128
+unsigned long shrink_one_shrinker(struct shrinker *shrinker,
+				  struct shrink_control *shrink,
+				  unsigned long nr_pages_scanned,
+				  unsigned long lru_pages)
+{
+	unsigned long ret = 0;
+	unsigned long long delta;
+	unsigned long total_scan;
+	unsigned long max_pass;
+	int shrink_ret = 0;
+	long nr;
+	long new_nr;
+	long batch_size = shrinker->batch ? shrinker->batch
+					  : SHRINK_BATCH;
+
+	/*
+	 * copy the current shrinker scan count into a local variable
+	 * and zero it so that other concurrent shrinker invocations
+	 * don't also do this scanning work.
+	 */
+	do {
+		nr = shrinker->nr;
+	} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+	total_scan = nr;
+	max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+	delta = (4 * nr_pages_scanned) / shrinker->seeks;
+	delta *= max_pass;
+	do_div(delta, lru_pages + 1);
+	total_scan += delta;
+	if (total_scan < 0) {
+		printk(KERN_ERR "shrink_slab: %pF negative objects to "
+		       "delete nr=%ld\n",
+		       shrinker->shrink, total_scan);
+		total_scan = max_pass;
+	}
+
+	/*
+	 * We need to avoid excessive windup on filesystem shrinkers
+	 * due to large numbers of GFP_NOFS allocations causing the
+	 * shrinkers to return -1 all the time. This results in a large
+	 * nr being built up so when a shrink that can do some work
+	 * comes along it empties the entire cache due to nr >>>
+	 * max_pass.  This is bad for sustaining a working set in
+	 * memory.
+	 *
+	 * Hence only allow the shrinker to scan the entire cache when
+	 * a large delta change is calculated directly.
+	 */
+	if (delta < max_pass / 4)
+		total_scan = min(total_scan, max_pass / 2);
+
+	/*
+	 * Avoid risking looping forever due to too large nr value:
+	 * never try to free more than twice the estimate number of
+	 * freeable entries.
+	 */
+	if (total_scan > max_pass * 2)
+		total_scan = max_pass * 2;
+
+	trace_mm_shrink_slab_start(shrinker, shrink, nr,
+				nr_pages_scanned, lru_pages,
+				max_pass, delta, total_scan);
+
+	while (total_scan >= batch_size) {
+		int nr_before;
+
+		nr_before = do_shrinker_shrink(shrinker, shrink, 0);
+		shrink_ret = do_shrinker_shrink(shrinker, shrink,
+						batch_size);
+		if (shrink_ret == -1)
+			break;
+		if (shrink_ret < nr_before)
+			ret += nr_before - shrink_ret;
+		count_vm_events(SLABS_SCANNED, batch_size);
+		total_scan -= batch_size;
+
+		cond_resched();
+	}
+
+	/*
+	 * move the unused scan count back into the shrinker in a
+	 * manner that handles concurrent updates. If we exhausted the
+	 * scan, there is no need to do an update.
+	 */
+	do {
+		nr = shrinker->nr;
+		new_nr = total_scan + nr;
+		if (total_scan <= 0)
+			break;
+	} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+	trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+	return ret;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -247,93 +343,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		unsigned long long delta;
-		unsigned long total_scan;
-		unsigned long max_pass;
-		int shrink_ret = 0;
-		long nr;
-		long new_nr;
-		long batch_size = shrinker->batch ? shrinker->batch
-						  : SHRINK_BATCH;
-
-		/*
-		 * copy the current shrinker scan count into a local variable
-		 * and zero it so that other concurrent shrinker invocations
-		 * don't also do this scanning work.
-		 */
-		do {
-			nr = shrinker->nr;
-		} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
-
-		total_scan = nr;
-		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
-		delta = (4 * nr_pages_scanned) / shrinker->seeks;
-		delta *= max_pass;
-		do_div(delta, lru_pages + 1);
-		total_scan += delta;
-		if (total_scan < 0) {
-			printk(KERN_ERR "shrink_slab: %pF negative objects to "
-			       "delete nr=%ld\n",
-			       shrinker->shrink, total_scan);
-			total_scan = max_pass;
-		}
-
-		/*
-		 * We need to avoid excessive windup on filesystem shrinkers
-		 * due to large numbers of GFP_NOFS allocations causing the
-		 * shrinkers to return -1 all the time. This results in a large
-		 * nr being built up so when a shrink that can do some work
-		 * comes along it empties the entire cache due to nr >>>
-		 * max_pass.  This is bad for sustaining a working set in
-		 * memory.
-		 *
-		 * Hence only allow the shrinker to scan the entire cache when
-		 * a large delta change is calculated directly.
-		 */
-		if (delta < max_pass / 4)
-			total_scan = min(total_scan, max_pass / 2);
-
-		/*
-		 * Avoid risking looping forever due to too large nr value:
-		 * never try to free more than twice the estimate number of
-		 * freeable entries.
-		 */
-		if (total_scan > max_pass * 2)
-			total_scan = max_pass * 2;
-
-		trace_mm_shrink_slab_start(shrinker, shrink, nr,
-					nr_pages_scanned, lru_pages,
-					max_pass, delta, total_scan);
-
-		while (total_scan >= batch_size) {
-			int nr_before;
-
-			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
-			shrink_ret = do_shrinker_shrink(shrinker, shrink,
-							batch_size);
-			if (shrink_ret == -1)
-				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
-			count_vm_events(SLABS_SCANNED, batch_size);
-			total_scan -= batch_size;
-
-			cond_resched();
-		}
-
-		/*
-		 * move the unused scan count back into the shrinker in a
-		 * manner that handles concurrent updates. If we exhausted the
-		 * scan, there is no need to do an update.
-		 */
-		do {
-			nr = shrinker->nr;
-			new_nr = total_scan + nr;
-			if (total_scan <= 0)
-				break;
-		} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
-
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+		ret += shrink_one_shrinker(shrinker, shrink,
+					   nr_pages_scanned, lru_pages);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 1/4] factor out single-shrinker code
@ 2011-08-14 15:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley, David Chinner, Glauber Costa,
	Eric Dumazet

While shrinking our caches, vmscan.c passes through all
registered shrinkers, trying to free objects as it goes.

We would like to do that individually for some caches,
like the dcache, when certain conditions apply (for
example, when we reach a soon-to-exist maximum allowed size)

To avoid re-writing the same logic at more than one place,
this patch factors out the shrink logic at shrink_one_shrinker(),
that we can call from other places of the kernel.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/shrinker.h |    6 ++
 mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
 2 files changed, 104 insertions(+), 87 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 790651b..c5db650 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -39,4 +39,10 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
+
+unsigned long shrink_one_shrinker(struct shrinker *shrinker,
+				  struct shrink_control *shrink,
+				  unsigned long nr_pages_scanned,                      
+				  unsigned long lru_pages);
+
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ef6912..50dfc61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
 }
 
 #define SHRINK_BATCH 128
+unsigned long shrink_one_shrinker(struct shrinker *shrinker,
+				  struct shrink_control *shrink,
+				  unsigned long nr_pages_scanned,
+				  unsigned long lru_pages)
+{
+	unsigned long ret = 0;
+	unsigned long long delta;
+	unsigned long total_scan;
+	unsigned long max_pass;
+	int shrink_ret = 0;
+	long nr;
+	long new_nr;
+	long batch_size = shrinker->batch ? shrinker->batch
+					  : SHRINK_BATCH;
+
+	/*
+	 * copy the current shrinker scan count into a local variable
+	 * and zero it so that other concurrent shrinker invocations
+	 * don't also do this scanning work.
+	 */
+	do {
+		nr = shrinker->nr;
+	} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+	total_scan = nr;
+	max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+	delta = (4 * nr_pages_scanned) / shrinker->seeks;
+	delta *= max_pass;
+	do_div(delta, lru_pages + 1);
+	total_scan += delta;
+	if (total_scan < 0) {
+		printk(KERN_ERR "shrink_slab: %pF negative objects to "
+		       "delete nr=%ld\n",
+		       shrinker->shrink, total_scan);
+		total_scan = max_pass;
+	}
+
+	/*
+	 * We need to avoid excessive windup on filesystem shrinkers
+	 * due to large numbers of GFP_NOFS allocations causing the
+	 * shrinkers to return -1 all the time. This results in a large
+	 * nr being built up so when a shrink that can do some work
+	 * comes along it empties the entire cache due to nr >>>
+	 * max_pass.  This is bad for sustaining a working set in
+	 * memory.
+	 *
+	 * Hence only allow the shrinker to scan the entire cache when
+	 * a large delta change is calculated directly.
+	 */
+	if (delta < max_pass / 4)
+		total_scan = min(total_scan, max_pass / 2);
+
+	/*
+	 * Avoid risking looping forever due to too large nr value:
+	 * never try to free more than twice the estimate number of
+	 * freeable entries.
+	 */
+	if (total_scan > max_pass * 2)
+		total_scan = max_pass * 2;
+
+	trace_mm_shrink_slab_start(shrinker, shrink, nr,
+				nr_pages_scanned, lru_pages,
+				max_pass, delta, total_scan);
+
+	while (total_scan >= batch_size) {
+		int nr_before;
+
+		nr_before = do_shrinker_shrink(shrinker, shrink, 0);
+		shrink_ret = do_shrinker_shrink(shrinker, shrink,
+						batch_size);
+		if (shrink_ret == -1)
+			break;
+		if (shrink_ret < nr_before)
+			ret += nr_before - shrink_ret;
+		count_vm_events(SLABS_SCANNED, batch_size);
+		total_scan -= batch_size;
+
+		cond_resched();
+	}
+
+	/*
+	 * move the unused scan count back into the shrinker in a
+	 * manner that handles concurrent updates. If we exhausted the
+	 * scan, there is no need to do an update.
+	 */
+	do {
+		nr = shrinker->nr;
+		new_nr = total_scan + nr;
+		if (total_scan <= 0)
+			break;
+	} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+	trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+	return ret;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -247,93 +343,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		unsigned long long delta;
-		unsigned long total_scan;
-		unsigned long max_pass;
-		int shrink_ret = 0;
-		long nr;
-		long new_nr;
-		long batch_size = shrinker->batch ? shrinker->batch
-						  : SHRINK_BATCH;
-
-		/*
-		 * copy the current shrinker scan count into a local variable
-		 * and zero it so that other concurrent shrinker invocations
-		 * don't also do this scanning work.
-		 */
-		do {
-			nr = shrinker->nr;
-		} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
-
-		total_scan = nr;
-		max_pass = do_shrinker_shrink(shrinker, shrink, 0);
-		delta = (4 * nr_pages_scanned) / shrinker->seeks;
-		delta *= max_pass;
-		do_div(delta, lru_pages + 1);
-		total_scan += delta;
-		if (total_scan < 0) {
-			printk(KERN_ERR "shrink_slab: %pF negative objects to "
-			       "delete nr=%ld\n",
-			       shrinker->shrink, total_scan);
-			total_scan = max_pass;
-		}
-
-		/*
-		 * We need to avoid excessive windup on filesystem shrinkers
-		 * due to large numbers of GFP_NOFS allocations causing the
-		 * shrinkers to return -1 all the time. This results in a large
-		 * nr being built up so when a shrink that can do some work
-		 * comes along it empties the entire cache due to nr >>>
-		 * max_pass.  This is bad for sustaining a working set in
-		 * memory.
-		 *
-		 * Hence only allow the shrinker to scan the entire cache when
-		 * a large delta change is calculated directly.
-		 */
-		if (delta < max_pass / 4)
-			total_scan = min(total_scan, max_pass / 2);
-
-		/*
-		 * Avoid risking looping forever due to too large nr value:
-		 * never try to free more than twice the estimate number of
-		 * freeable entries.
-		 */
-		if (total_scan > max_pass * 2)
-			total_scan = max_pass * 2;
-
-		trace_mm_shrink_slab_start(shrinker, shrink, nr,
-					nr_pages_scanned, lru_pages,
-					max_pass, delta, total_scan);
-
-		while (total_scan >= batch_size) {
-			int nr_before;
-
-			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
-			shrink_ret = do_shrinker_shrink(shrinker, shrink,
-							batch_size);
-			if (shrink_ret == -1)
-				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
-			count_vm_events(SLABS_SCANNED, batch_size);
-			total_scan -= batch_size;
-
-			cond_resched();
-		}
-
-		/*
-		 * move the unused scan count back into the shrinker in a
-		 * manner that handles concurrent updates. If we exhausted the
-		 * scan, there is no need to do an update.
-		 */
-		do {
-			nr = shrinker->nr;
-			new_nr = total_scan + nr;
-			if (total_scan <= 0)
-				break;
-		} while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
-
-		trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+		ret += shrink_one_shrinker(shrinker, shrink,
+					   nr_pages_scanned, lru_pages);
 	}
 	up_read(&shrinker_rwsem);
 out:
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 2/4] Keep nr_dentry per super block
  2011-08-14 15:13 ` Glauber Costa
@ 2011-08-14 15:13     ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Now that we have per-sb shrinkers, it makes sense to have nr_dentries
stored per sb as well. We turn them into per-cpu counters so we can
keep acessing them without locking.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/dcache.c        |   12 +++++++++++-
 fs/super.c         |    4 ++++
 include/linux/fs.h |    2 ++
 3 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b05aac3..815d9fd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -151,7 +151,13 @@ static void __d_free(struct rcu_head *head)
 static void d_free(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_count);
+	/*
+	 * It is cheaper to keep a global counter separate
+	 * then to scan through all superblocks when needed
+	 */
 	this_cpu_dec(nr_dentry);
+	percpu_counter_dec(&dentry->d_sb->s_nr_dentry);
+
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
@@ -1224,8 +1230,12 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_alias);
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
-
+	/*
+	 * It is cheaper to keep a global counter separate
+	 * then to scan through all superblocks when needed
+	 */
 	this_cpu_inc(nr_dentry);
+	percpu_counter_inc(&dentry->d_sb->s_nr_dentry);
 
 	return dentry;
 }
diff --git a/fs/super.c b/fs/super.c
index 3f56a26..e95ac4f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			s = NULL;
 			goto out;
 		}
+
+		percpu_counter_init(&s->s_nr_dentry, 0);
+
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
 		if (!s->s_files) {
@@ -199,6 +202,7 @@ static inline void destroy_super(struct super_block *s)
 #ifdef CONFIG_SMP
 	free_percpu(s->s_files);
 #endif
+	percpu_counter_destroy(&s->s_nr_dentry);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f23bcb7..8150f52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1399,6 +1399,8 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
+
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inode_lru;		/* unused inode lru */
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 2/4] Keep nr_dentry per super block
@ 2011-08-14 15:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley, David Chinner, Glauber Costa,
	Eric Dumazet

Now that we have per-sb shrinkers, it makes sense to have nr_dentries
stored per sb as well. We turn them into per-cpu counters so we can
keep acessing them without locking.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/dcache.c        |   12 +++++++++++-
 fs/super.c         |    4 ++++
 include/linux/fs.h |    2 ++
 3 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b05aac3..815d9fd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -151,7 +151,13 @@ static void __d_free(struct rcu_head *head)
 static void d_free(struct dentry *dentry)
 {
 	BUG_ON(dentry->d_count);
+	/*
+	 * It is cheaper to keep a global counter separate
+	 * then to scan through all superblocks when needed
+	 */
 	this_cpu_dec(nr_dentry);
+	percpu_counter_dec(&dentry->d_sb->s_nr_dentry);
+
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
@@ -1224,8 +1230,12 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_alias);
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
-
+	/*
+	 * It is cheaper to keep a global counter separate
+	 * then to scan through all superblocks when needed
+	 */
 	this_cpu_inc(nr_dentry);
+	percpu_counter_inc(&dentry->d_sb->s_nr_dentry);
 
 	return dentry;
 }
diff --git a/fs/super.c b/fs/super.c
index 3f56a26..e95ac4f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			s = NULL;
 			goto out;
 		}
+
+		percpu_counter_init(&s->s_nr_dentry, 0);
+
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
 		if (!s->s_files) {
@@ -199,6 +202,7 @@ static inline void destroy_super(struct super_block *s)
 #ifdef CONFIG_SMP
 	free_percpu(s->s_files);
 #endif
+	percpu_counter_destroy(&s->s_nr_dentry);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f23bcb7..8150f52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1399,6 +1399,8 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
+
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inode_lru;		/* unused inode lru */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-14 15:13 ` Glauber Costa
@ 2011-08-14 15:13     ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

This patch lays the foundation for us to limit the dcache size.
Each super block can have only a maximum amount of dentries under its
sub-tree. Allocation fails if we we're over limit and the cache
can't be pruned to free up space for the newcomers.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/dcache.c        |   28 ++++++++++++++++++++++++++++
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 815d9fd..ddd02a2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
+static inline int dcache_mem_check(struct super_block *sb)
+{
+	struct shrink_control sc = {
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	if (sb->s_nr_dentry_max == INT_MAX)
+		return 0;
+
+	do {
+		int nr_dentry;
+
+		nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
+		if (nr_dentry > sb->s_nr_dentry_max)
+			nr_dentry =
+				percpu_counter_sum_positive(&sb->s_nr_dentry);
+		if (nr_dentry < sb->s_nr_dentry_max)
+			return 0;
+
+	/* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
+	} while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
+
+	return -ENOMEM;
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 
+	if (dcache_mem_check(sb))
+		return NULL;
+
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
 	if (!dentry)
 		return NULL;
diff --git a/fs/super.c b/fs/super.c
index e95ac4f..3db40fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		}
 
 		percpu_counter_init(&s->s_nr_dentry, 0);
+		s->s_nr_dentry_max = INT_MAX;
 
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8150f52..bc773c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1400,6 +1400,7 @@ struct super_block {
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
 	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
+	int 			s_nr_dentry_max;	/* max # of dentry on this sb*/
 
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-14 15:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley, David Chinner, Glauber Costa,
	Eric Dumazet

This patch lays the foundation for us to limit the dcache size.
Each super block can have only a maximum amount of dentries under its
sub-tree. Allocation fails if we we're over limit and the cache
can't be pruned to free up space for the newcomers.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/dcache.c        |   28 ++++++++++++++++++++++++++++
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 815d9fd..ddd02a2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
+static inline int dcache_mem_check(struct super_block *sb)
+{
+	struct shrink_control sc = {
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	if (sb->s_nr_dentry_max == INT_MAX)
+		return 0;
+
+	do {
+		int nr_dentry;
+
+		nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
+		if (nr_dentry > sb->s_nr_dentry_max)
+			nr_dentry =
+				percpu_counter_sum_positive(&sb->s_nr_dentry);
+		if (nr_dentry < sb->s_nr_dentry_max)
+			return 0;
+
+	/* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
+	} while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
+
+	return -ENOMEM;
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	struct dentry *dentry;
 	char *dname;
 
+	if (dcache_mem_check(sb))
+		return NULL;
+
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
 	if (!dentry)
 		return NULL;
diff --git a/fs/super.c b/fs/super.c
index e95ac4f..3db40fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		}
 
 		percpu_counter_init(&s->s_nr_dentry, 0);
+		s->s_nr_dentry_max = INT_MAX;
 
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8150f52..bc773c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1400,6 +1400,7 @@ struct super_block {
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
 	struct percpu_counter 	s_nr_dentry;		/* # of dentry on this sb */
+	int 			s_nr_dentry_max;	/* max # of dentry on this sb*/
 
 	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
 	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 4/4] parse options in the vfs level
  2011-08-14 15:13 ` Glauber Costa
@ 2011-08-14 15:13     ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

This patch introduces a simple generic vfs option parser.
Right now, the only option we have is to limit the size of the dcache.

So any user that wants to have a dcache entries limit, can specify:

  mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>

It is supposed to work well with remounts, allowing it to change
multiple over the course of the filesystem's lifecycle.

I find mount a natural interface for handling filesystem options,
so that's what I've choosen. Feel free to yell at it at will if
you disagree.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/dcache.c            |    7 +++
 fs/namespace.c         |  105 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |    4 ++
 3 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ddd02a2..0551305 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1205,6 +1205,13 @@ static inline int dcache_mem_check(struct super_block *sb)
 	return -ENOMEM;
 }
 
+int vfs_set_dcache_size(struct super_block *sb, int size)
+{
+	sb->s_nr_dentry_max = size;
+
+	return dcache_mem_check(sb);
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
diff --git a/fs/namespace.c b/fs/namespace.c
index 22bfe82..43b2cdb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <linux/idr.h>
 #include <linux/fs_struct.h>
 #include <linux/fsnotify.h>
+#include <linux/parser.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -958,6 +959,9 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 	};
 	const struct proc_fs_info *fs_infop;
 
+	if (sb->s_nr_dentry_max != INT_MAX)
+		seq_printf(m, ",vfs_dcache_size=%d",sb->s_nr_dentry_max);
+
 	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
@@ -2271,6 +2275,94 @@ int copy_mount_string(const void __user *data, char **where)
 	return 0;
 }
 
+static const match_table_t tokens = {
+	{1, "vfs_dcache_size=%u"},
+};
+
+struct vfs_options {
+	unsigned long vfs_dcache_size;
+};
+
+/**
+ * Generic option parsing for the VFS.
+ *
+ * Since most of the filesystems already do their own option parsing, and with
+ * very few code shared between them, this function strips out any options that
+ * we succeed in parsing ourselves. Passing them forward would just give the
+ * underlying fs an option it does not expect, leading it to fail.
+ *
+ * We don't yet have a pointer to the super block as well, since this is
+ * pre-mount. We accumulate in struct vfs_options whatever data we collected,
+ * and act on it later.
+ */
+static int vfs_parse_options(char *options, struct vfs_options *ops)
+{
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	char *p;
+	char *opt;
+	char *start = NULL;
+	int ret;
+	
+	if (!options)
+		return 0;
+
+	opt = kstrdup(options, GFP_KERNEL);
+	if (!opt)
+		return 1;
+	
+	ret = 1;
+
+	start = opt;
+	while ((p = strsep(&opt, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		/*
+		 * Initialize args struct so we know whether arg was
+		 * found; some options take optional arguments.
+		 */
+		args[0].to = args[0].from = 0;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case 1:
+			if (!args[0].from)
+				break;
+
+			if (match_int(&args[0], &option))
+				break;
+
+			if (option < DCACHE_MIN_SIZE) {
+				printk(KERN_INFO "dcache size %d smaller than "
+				       "minimum (%d)\n", option, DCACHE_MIN_SIZE);
+				option = DCACHE_MIN_SIZE;
+			}
+
+			ops->vfs_dcache_size = option;
+
+			/*
+			 * The actual filesystems don't expect any option
+			 * they don't understand to be received in the option
+			 * string. So we strip off anything we processed, and
+			 * give them a clean options string.
+			 */
+			ret = 0;
+			if (!opt) /* it is the last option listed */
+				*(options + (p - start)) = '\0';
+			else
+				strcpy(options + (p - start), opt);
+			break;
+		default:
+			ret = 0;
+			break;
+		}
+	}
+
+	kfree(start);
+	return ret;
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
@@ -2291,6 +2383,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	struct path path;
 	int retval = 0;
 	int mnt_flags = 0;
+	struct vfs_options vfs_options;
 
 	/* Discard magic */
 	if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
@@ -2318,6 +2411,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	if (!(flags & MS_NOATIME))
 		mnt_flags |= MNT_RELATIME;
 
+
+	vfs_options.vfs_dcache_size = INT_MAX;
+	retval = vfs_parse_options(data_page, &vfs_options);
+	if (retval)
+		goto dput_out;
+
 	/* Separate the per-mountpoint flags */
 	if (flags & MS_NOSUID)
 		mnt_flags |= MNT_NOSUID;
@@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	else
 		retval = do_new_mount(&path, type_page, flags, mnt_flags,
 				      dev_name, data_page);
+
+	/* bind mounts get to respect their parents decision */
+	if (!retval && !(flags & MS_BIND))
+		vfs_set_dcache_size(path.mnt->mnt_sb,
+				    vfs_options.vfs_dcache_size);
+			
 dput_out:
 	path_put(&path);
 	return retval;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d37d2a7..1a309f3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern int vfs_set_dcache_size(struct super_block *sb, int size);
+
+#define DCACHE_MIN_SIZE 1024
+extern int vfs_set_dcache_size(struct super_block *sb, int size);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 4/4] parse options in the vfs level
@ 2011-08-14 15:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-14 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, containers, Pavel Emelyanov, Al Viro,
	Hugh Dickins, Nick Piggin, Andrea Arcangeli, Rik van Riel,
	Dave Hansen, James Bottomley, David Chinner, Glauber Costa,
	Eric Dumazet

This patch introduces a simple generic vfs option parser.
Right now, the only option we have is to limit the size of the dcache.

So any user that wants to have a dcache entries limit, can specify:

  mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>

It is supposed to work well with remounts, allowing it to change
multiple over the course of the filesystem's lifecycle.

I find mount a natural interface for handling filesystem options,
so that's what I've choosen. Feel free to yell at it at will if
you disagree.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/dcache.c            |    7 +++
 fs/namespace.c         |  105 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |    4 ++
 3 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ddd02a2..0551305 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1205,6 +1205,13 @@ static inline int dcache_mem_check(struct super_block *sb)
 	return -ENOMEM;
 }
 
+int vfs_set_dcache_size(struct super_block *sb, int size)
+{
+	sb->s_nr_dentry_max = size;
+
+	return dcache_mem_check(sb);
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
diff --git a/fs/namespace.c b/fs/namespace.c
index 22bfe82..43b2cdb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <linux/idr.h>
 #include <linux/fs_struct.h>
 #include <linux/fsnotify.h>
+#include <linux/parser.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -958,6 +959,9 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 	};
 	const struct proc_fs_info *fs_infop;
 
+	if (sb->s_nr_dentry_max != INT_MAX)
+		seq_printf(m, ",vfs_dcache_size=%d",sb->s_nr_dentry_max);
+
 	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
@@ -2271,6 +2275,94 @@ int copy_mount_string(const void __user *data, char **where)
 	return 0;
 }
 
+static const match_table_t tokens = {
+	{1, "vfs_dcache_size=%u"},
+};
+
+struct vfs_options {
+	unsigned long vfs_dcache_size;
+};
+
+/**
+ * Generic option parsing for the VFS.
+ *
+ * Since most of the filesystems already do their own option parsing, and with
+ * very few code shared between them, this function strips out any options that
+ * we succeed in parsing ourselves. Passing them forward would just give the
+ * underlying fs an option it does not expect, leading it to fail.
+ *
+ * We don't yet have a pointer to the super block as well, since this is
+ * pre-mount. We accumulate in struct vfs_options whatever data we collected,
+ * and act on it later.
+ */
+static int vfs_parse_options(char *options, struct vfs_options *ops)
+{
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	char *p;
+	char *opt;
+	char *start = NULL;
+	int ret;
+	
+	if (!options)
+		return 0;
+
+	opt = kstrdup(options, GFP_KERNEL);
+	if (!opt)
+		return 1;
+	
+	ret = 1;
+
+	start = opt;
+	while ((p = strsep(&opt, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		/*
+		 * Initialize args struct so we know whether arg was
+		 * found; some options take optional arguments.
+		 */
+		args[0].to = args[0].from = 0;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case 1:
+			if (!args[0].from)
+				break;
+
+			if (match_int(&args[0], &option))
+				break;
+
+			if (option < DCACHE_MIN_SIZE) {
+				printk(KERN_INFO "dcache size %d smaller than "
+				       "minimum (%d)\n", option, DCACHE_MIN_SIZE);
+				option = DCACHE_MIN_SIZE;
+			}
+
+			ops->vfs_dcache_size = option;
+
+			/*
+			 * The actual filesystems don't expect any option
+			 * they don't understand to be received in the option
+			 * string. So we strip off anything we processed, and
+			 * give them a clean options string.
+			 */
+			ret = 0;
+			if (!opt) /* it is the last option listed */
+				*(options + (p - start)) = '\0';
+			else
+				strcpy(options + (p - start), opt);
+			break;
+		default:
+			ret = 0;
+			break;
+		}
+	}
+
+	kfree(start);
+	return ret;
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
@@ -2291,6 +2383,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	struct path path;
 	int retval = 0;
 	int mnt_flags = 0;
+	struct vfs_options vfs_options;
 
 	/* Discard magic */
 	if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
@@ -2318,6 +2411,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	if (!(flags & MS_NOATIME))
 		mnt_flags |= MNT_RELATIME;
 
+
+	vfs_options.vfs_dcache_size = INT_MAX;
+	retval = vfs_parse_options(data_page, &vfs_options);
+	if (retval)
+		goto dput_out;
+
 	/* Separate the per-mountpoint flags */
 	if (flags & MS_NOSUID)
 		mnt_flags |= MNT_NOSUID;
@@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 	else
 		retval = do_new_mount(&path, type_page, flags, mnt_flags,
 				      dev_name, data_page);
+
+	/* bind mounts get to respect their parents decision */
+	if (!retval && !(flags & MS_BIND))
+		vfs_set_dcache_size(path.mnt->mnt_sb,
+				    vfs_options.vfs_dcache_size);
+			
 dput_out:
 	path_put(&path);
 	return retval;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d37d2a7..1a309f3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern int vfs_set_dcache_size(struct super_block *sb, int size);
+
+#define DCACHE_MIN_SIZE 1024
+extern int vfs_set_dcache_size(struct super_block *sb, int size);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
       [not found]     ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-14 15:39       ` Vasiliy Kulikov
  2011-08-15  7:09       ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Vasiliy Kulikov @ 2011-08-14 15:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	David Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Hi Glauber,

On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote:
> +/**
> + * Generic option parsing for the VFS.
> + *
> + * Since most of the filesystems already do their own option parsing, and with
> + * very few code shared between them, this function strips out any options that
> + * we succeed in parsing ourselves. Passing them forward would just give the
> + * underlying fs an option it does not expect, leading it to fail.
> + *
> + * We don't yet have a pointer to the super block as well, since this is
> + * pre-mount. We accumulate in struct vfs_options whatever data we collected,
> + * and act on it later.
> + */
> +static int vfs_parse_options(char *options, struct vfs_options *ops)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	int option;
> +	char *p;
> +	char *opt;
> +	char *start = NULL;
> +	int ret;
> +	
> +	if (!options)
> +		return 0;
> +
> +	opt = kstrdup(options, GFP_KERNEL);
> +	if (!opt)
> +		return 1;
> +	
> +	ret = 1;
> +
> +	start = opt;
> +	while ((p = strsep(&opt, ",")) != NULL) {
> +		int token;
> +		if (!*p)
> +			continue;
> +
> +		/*
> +		 * Initialize args struct so we know whether arg was
> +		 * found; some options take optional arguments.
> +		 */
> +		args[0].to = args[0].from = 0;
> +		token = match_token(p, tokens, args);
> +		switch (token) {
> +		case 1:
> +			if (!args[0].from)
> +				break;
> +
> +			if (match_int(&args[0], &option))
> +				break;

What if there are 2 passed options and the second fails?

   mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP <dev> <mntpoint>

In this case you leave the second option and pass it to the fs option
parser (as you already set ret=0), which is wrong.  I think you should
explicitly return 1 where you know the option is related to VFS, but you
failed to parse it.  It would look even simplier than current code.

(Yes, this is a rare situation, but I can imagine some program that
automatically adds mount options to the existing list and passes it to
mount.)

> +			if (option < DCACHE_MIN_SIZE) {
> +				printk(KERN_INFO "dcache size %d smaller than "
> +				       "minimum (%d)\n", option, DCACHE_MIN_SIZE);
> +				option = DCACHE_MIN_SIZE;
> +			}
> +
> +			ops->vfs_dcache_size = option;
> +
> +			/*
> +			 * The actual filesystems don't expect any option
> +			 * they don't understand to be received in the option
> +			 * string. So we strip off anything we processed, and
> +			 * give them a clean options string.
> +			 */
> +			ret = 0;
> +			if (!opt) /* it is the last option listed */
> +				*(options + (p - start)) = '\0';
> +			else
> +				strcpy(options + (p - start), opt);
> +			break;
> +		default:
> +			ret = 0;
> +			break;
> +		}
> +	}
> +
> +	kfree(start);
> +	return ret;
> +}

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-14 15:13     ` Glauber Costa
  (?)
@ 2011-08-14 15:39     ` Vasiliy Kulikov
  2011-08-15  0:03       ` Glauber Costa
  2011-08-15  0:03       ` Glauber Costa
  -1 siblings, 2 replies; 65+ messages in thread
From: Vasiliy Kulikov @ 2011-08-14 15:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, Andrea Arcangeli, Rik van Riel, Nick Piggin,
	Eric Dumazet, Pavel Emelyanov, David Chinner, containers,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel

Hi Glauber,

On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote:
> +/**
> + * Generic option parsing for the VFS.
> + *
> + * Since most of the filesystems already do their own option parsing, and with
> + * very few code shared between them, this function strips out any options that
> + * we succeed in parsing ourselves. Passing them forward would just give the
> + * underlying fs an option it does not expect, leading it to fail.
> + *
> + * We don't yet have a pointer to the super block as well, since this is
> + * pre-mount. We accumulate in struct vfs_options whatever data we collected,
> + * and act on it later.
> + */
> +static int vfs_parse_options(char *options, struct vfs_options *ops)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	int option;
> +	char *p;
> +	char *opt;
> +	char *start = NULL;
> +	int ret;
> +	
> +	if (!options)
> +		return 0;
> +
> +	opt = kstrdup(options, GFP_KERNEL);
> +	if (!opt)
> +		return 1;
> +	
> +	ret = 1;
> +
> +	start = opt;
> +	while ((p = strsep(&opt, ",")) != NULL) {
> +		int token;
> +		if (!*p)
> +			continue;
> +
> +		/*
> +		 * Initialize args struct so we know whether arg was
> +		 * found; some options take optional arguments.
> +		 */
> +		args[0].to = args[0].from = 0;
> +		token = match_token(p, tokens, args);
> +		switch (token) {
> +		case 1:
> +			if (!args[0].from)
> +				break;
> +
> +			if (match_int(&args[0], &option))
> +				break;

What if there are 2 passed options and the second fails?

   mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP <dev> <mntpoint>

In this case you leave the second option and pass it to the fs option
parser (as you already set ret=0), which is wrong.  I think you should
explicitly return 1 where you know the option is related to VFS, but you
failed to parse it.  It would look even simplier than current code.

(Yes, this is a rare situation, but I can imagine some program that
automatically adds mount options to the existing list and passes it to
mount.)

> +			if (option < DCACHE_MIN_SIZE) {
> +				printk(KERN_INFO "dcache size %d smaller than "
> +				       "minimum (%d)\n", option, DCACHE_MIN_SIZE);
> +				option = DCACHE_MIN_SIZE;
> +			}
> +
> +			ops->vfs_dcache_size = option;
> +
> +			/*
> +			 * The actual filesystems don't expect any option
> +			 * they don't understand to be received in the option
> +			 * string. So we strip off anything we processed, and
> +			 * give them a clean options string.
> +			 */
> +			ret = 0;
> +			if (!opt) /* it is the last option listed */
> +				*(options + (p - start)) = '\0';
> +			else
> +				strcpy(options + (p - start), opt);
> +			break;
> +		default:
> +			ret = 0;
> +			break;
> +		}
> +	}
> +
> +	kfree(start);
> +	return ret;
> +}

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/4] Keep nr_dentry per super block
       [not found]     ` <1313334832-1150-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-14 17:38       ` Eric Dumazet
  0 siblings, 0 replies; 65+ messages in thread
From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit :
> Now that we have per-sb shrinkers, it makes sense to have nr_dentries
> stored per sb as well. We turn them into per-cpu counters so we can
> keep acessing them without locking.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>

>  }
> diff --git a/fs/super.c b/fs/super.c
> index 3f56a26..e95ac4f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  			s = NULL;
>  			goto out;
>  		}
> +
> +		percpu_counter_init(&s->s_nr_dentry, 0);
> +

You should take care of possible error return at this point.

(We probably should check the percpu_counter_init() done in tcp_init(),
but they are done at boot time, with nearly no chance of failure)



_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/4] Keep nr_dentry per super block
  2011-08-14 15:13     ` Glauber Costa
@ 2011-08-14 17:38       ` Eric Dumazet
  -1 siblings, 0 replies; 65+ messages in thread
From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, David Chinner

Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit :
> Now that we have per-sb shrinkers, it makes sense to have nr_dentries
> stored per sb as well. We turn them into per-cpu counters so we can
> keep acessing them without locking.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>

>  }
> diff --git a/fs/super.c b/fs/super.c
> index 3f56a26..e95ac4f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  			s = NULL;
>  			goto out;
>  		}
> +
> +		percpu_counter_init(&s->s_nr_dentry, 0);
> +

You should take care of possible error return at this point.

(We probably should check the percpu_counter_init() done in tcp_init(),
but they are done at boot time, with nearly no chance of failure)




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/4] Keep nr_dentry per super block
@ 2011-08-14 17:38       ` Eric Dumazet
  0 siblings, 0 replies; 65+ messages in thread
From: Eric Dumazet @ 2011-08-14 17:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, David Chinner

Le dimanche 14 août 2011 à 19:13 +0400, Glauber Costa a écrit :
> Now that we have per-sb shrinkers, it makes sense to have nr_dentries
> stored per sb as well. We turn them into per-cpu counters so we can
> keep acessing them without locking.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>

>  }
> diff --git a/fs/super.c b/fs/super.c
> index 3f56a26..e95ac4f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -119,6 +119,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  			s = NULL;
>  			goto out;
>  		}
> +
> +		percpu_counter_init(&s->s_nr_dentry, 0);
> +

You should take care of possible error return at this point.

(We probably should check the percpu_counter_init() done in tcp_init(),
but they are done at boot time, with nearly no chance of failure)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-14 15:39     ` Vasiliy Kulikov
  2011-08-15  0:03       ` Glauber Costa
@ 2011-08-15  0:03       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-15  0:03 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	David Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 08/14/2011 08:39 AM, Vasiliy Kulikov wrote:
> Hi Glauber,
>
> On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote:
>> +/**
>> + * Generic option parsing for the VFS.
>> + *
>> + * Since most of the filesystems already do their own option parsing, and with
>> + * very few code shared between them, this function strips out any options that
>> + * we succeed in parsing ourselves. Passing them forward would just give the
>> + * underlying fs an option it does not expect, leading it to fail.
>> + *
>> + * We don't yet have a pointer to the super block as well, since this is
>> + * pre-mount. We accumulate in struct vfs_options whatever data we collected,
>> + * and act on it later.
>> + */
>> +static int vfs_parse_options(char *options, struct vfs_options *ops)
>> +{
>> +	substring_t args[MAX_OPT_ARGS];
>> +	int option;
>> +	char *p;
>> +	char *opt;
>> +	char *start = NULL;
>> +	int ret;
>> +	
>> +	if (!options)
>> +		return 0;
>> +
>> +	opt = kstrdup(options, GFP_KERNEL);
>> +	if (!opt)
>> +		return 1;
>> +	
>> +	ret = 1;
>> +
>> +	start = opt;
>> +	while ((p = strsep(&opt, ",")) != NULL) {
>> +		int token;
>> +		if (!*p)
>> +			continue;
>> +
>> +		/*
>> +		 * Initialize args struct so we know whether arg was
>> +		 * found; some options take optional arguments.
>> +		 */
>> +		args[0].to = args[0].from = 0;
>> +		token = match_token(p, tokens, args);
>> +		switch (token) {
>> +		case 1:
>> +			if (!args[0].from)
>> +				break;
>> +
>> +			if (match_int(&args[0],&option))
>> +				break;
>
> What if there are 2 passed options and the second fails?
>
>     mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP<dev>  <mntpoint>
>
> In this case you leave the second option and pass it to the fs option
> parser (as you already set ret=0), which is wrong.  I think you should
> explicitly return 1 where you know the option is related to VFS, but you
> failed to parse it.  It would look even simplier than current code.

Good point, thank you. I agree.

> (Yes, this is a rare situation, but I can imagine some program that
> automatically adds mount options to the existing list and passes it to
> mount.)
>
Absolutely.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-14 15:39     ` Vasiliy Kulikov
@ 2011-08-15  0:03       ` Glauber Costa
  2011-08-15  0:03       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-15  0:03 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: linux-kernel, Andrea Arcangeli, Rik van Riel, Nick Piggin,
	Eric Dumazet, Pavel Emelyanov, David Chinner, containers,
	Hugh Dickins, James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel

On 08/14/2011 08:39 AM, Vasiliy Kulikov wrote:
> Hi Glauber,
>
> On Sun, Aug 14, 2011 at 19:13 +0400, Glauber Costa wrote:
>> +/**
>> + * Generic option parsing for the VFS.
>> + *
>> + * Since most of the filesystems already do their own option parsing, and with
>> + * very few code shared between them, this function strips out any options that
>> + * we succeed in parsing ourselves. Passing them forward would just give the
>> + * underlying fs an option it does not expect, leading it to fail.
>> + *
>> + * We don't yet have a pointer to the super block as well, since this is
>> + * pre-mount. We accumulate in struct vfs_options whatever data we collected,
>> + * and act on it later.
>> + */
>> +static int vfs_parse_options(char *options, struct vfs_options *ops)
>> +{
>> +	substring_t args[MAX_OPT_ARGS];
>> +	int option;
>> +	char *p;
>> +	char *opt;
>> +	char *start = NULL;
>> +	int ret;
>> +	
>> +	if (!options)
>> +		return 0;
>> +
>> +	opt = kstrdup(options, GFP_KERNEL);
>> +	if (!opt)
>> +		return 1;
>> +	
>> +	ret = 1;
>> +
>> +	start = opt;
>> +	while ((p = strsep(&opt, ",")) != NULL) {
>> +		int token;
>> +		if (!*p)
>> +			continue;
>> +
>> +		/*
>> +		 * Initialize args struct so we know whether arg was
>> +		 * found; some options take optional arguments.
>> +		 */
>> +		args[0].to = args[0].from = 0;
>> +		token = match_token(p, tokens, args);
>> +		switch (token) {
>> +		case 1:
>> +			if (!args[0].from)
>> +				break;
>> +
>> +			if (match_int(&args[0],&option))
>> +				break;
>
> What if there are 2 passed options and the second fails?
>
>     mount -o vfs_dcache_size=XXX,vfs_dcache_size=CRAP<dev>  <mntpoint>
>
> In this case you leave the second option and pass it to the fs option
> parser (as you already set ret=0), which is wrong.  I think you should
> explicitly return 1 where you know the option is related to VFS, but you
> failed to parse it.  It would look even simplier than current code.

Good point, thank you. I agree.

> (Yes, this is a rare situation, but I can imagine some program that
> automatically adds mount options to the existing list and passes it to
> mount.)
>
Absolutely.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/4] factor out single-shrinker code
       [not found]     ` <1313334832-1150-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-15  6:43       ` Dave Chinner
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  6:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 14, 2011 at 07:13:49PM +0400, Glauber Costa wrote:
> While shrinking our caches, vmscan.c passes through all
> registered shrinkers, trying to free objects as it goes.
> 
> We would like to do that individually for some caches,
> like the dcache, when certain conditions apply (for
> example, when we reach a soon-to-exist maximum allowed size)
> 
> To avoid re-writing the same logic at more than one place,
> this patch factors out the shrink logic at shrink_one_shrinker(),
> that we can call from other places of the kernel.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  include/linux/shrinker.h |    6 ++
>  mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
>  2 files changed, 104 insertions(+), 87 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 790651b..c5db650 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -39,4 +39,10 @@ struct shrinker {
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  extern void register_shrinker(struct shrinker *);
>  extern void unregister_shrinker(struct shrinker *);
> +
> +unsigned long shrink_one_shrinker(struct shrinker *shrinker,
> +				  struct shrink_control *shrink,
> +				  unsigned long nr_pages_scanned,                      
> +				  unsigned long lru_pages);
> +
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7ef6912..50dfc61 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>  }
>  
>  #define SHRINK_BATCH 128
> +unsigned long shrink_one_shrinker(struct shrinker *shrinker,
> +				  struct shrink_control *shrink,
> +				  unsigned long nr_pages_scanned,
> +				  unsigned long lru_pages)
> +{

This isn't the right place to cut the code apart. The act of
shrinking a cache is separate to the act of calculating how much to
shrink. you're taking the vmscan interface and trying to hack around
that when direct calls to a single shrinker is needed.....

> +	unsigned long ret = 0;
> +	unsigned long long delta;
> +	unsigned long total_scan;
> +	unsigned long max_pass;
> +	int shrink_ret = 0;
> +	long nr;
> +	long new_nr;
> +	long batch_size = shrinker->batch ? shrinker->batch
> +					  : SHRINK_BATCH;
> +
> +	/*
> +	 * copy the current shrinker scan count into a local variable
> +	 * and zero it so that other concurrent shrinker invocations
> +	 * don't also do this scanning work.
> +	 */
> +	do {
> +		nr = shrinker->nr;
> +	} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
> +
> +	total_scan = nr;
> +	max_pass = do_shrinker_shrink(shrinker, shrink, 0);
> +	delta = (4 * nr_pages_scanned) / shrinker->seeks;
> +	delta *= max_pass;
> +	do_div(delta, lru_pages + 1);
> +	total_scan += delta;
> +	if (total_scan < 0) {
> +		printk(KERN_ERR "shrink_slab: %pF negative objects to "
> +		       "delete nr=%ld\n",
> +		       shrinker->shrink, total_scan);
> +		total_scan = max_pass;
> +	}
> +
> +	/*
> +	 * We need to avoid excessive windup on filesystem shrinkers
> +	 * due to large numbers of GFP_NOFS allocations causing the
> +	 * shrinkers to return -1 all the time. This results in a large
> +	 * nr being built up so when a shrink that can do some work
> +	 * comes along it empties the entire cache due to nr >>>
> +	 * max_pass.  This is bad for sustaining a working set in
> +	 * memory.
> +	 *
> +	 * Hence only allow the shrinker to scan the entire cache when
> +	 * a large delta change is calculated directly.
> +	 */
> +	if (delta < max_pass / 4)
> +		total_scan = min(total_scan, max_pass / 2);
> +
> +	/*
> +	 * Avoid risking looping forever due to too large nr value:
> +	 * never try to free more than twice the estimate number of
> +	 * freeable entries.
> +	 */
> +	if (total_scan > max_pass * 2)
> +		total_scan = max_pass * 2;
> +
> +	trace_mm_shrink_slab_start(shrinker, shrink, nr,
> +				nr_pages_scanned, lru_pages,
> +				max_pass, delta, total_scan);
> +

This bit from here:

> +	while (total_scan >= batch_size) {
> +		int nr_before;
> +
> +		nr_before = do_shrinker_shrink(shrinker, shrink, 0);
> +		shrink_ret = do_shrinker_shrink(shrinker, shrink,
> +						batch_size);
> +		if (shrink_ret == -1)
> +			break;
> +		if (shrink_ret < nr_before)
> +			ret += nr_before - shrink_ret;
> +		count_vm_events(SLABS_SCANNED, batch_size);
> +		total_scan -= batch_size;
> +
> +		cond_resched();
> +	}

to here is the only common code. It takes a locked shrinker and a
set up scan_control structure. As it is, I've got a set of patches
that I'm working on that fix the garbage srhinker API and abstract
this section of the code correctly for external users. I'll try to
get that finished for review later this week.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/4] factor out single-shrinker code
  2011-08-14 15:13     ` Glauber Costa
  (?)
@ 2011-08-15  6:43     ` Dave Chinner
  -1 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  6:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet

On Sun, Aug 14, 2011 at 07:13:49PM +0400, Glauber Costa wrote:
> While shrinking our caches, vmscan.c passes through all
> registered shrinkers, trying to free objects as it goes.
> 
> We would like to do that individually for some caches,
> like the dcache, when certain conditions apply (for
> example, when we reach a soon-to-exist maximum allowed size)
> 
> To avoid re-writing the same logic at more than one place,
> this patch factors out the shrink logic at shrink_one_shrinker(),
> that we can call from other places of the kernel.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/shrinker.h |    6 ++
>  mm/vmscan.c              |  185 ++++++++++++++++++++++++----------------------
>  2 files changed, 104 insertions(+), 87 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 790651b..c5db650 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -39,4 +39,10 @@ struct shrinker {
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  extern void register_shrinker(struct shrinker *);
>  extern void unregister_shrinker(struct shrinker *);
> +
> +unsigned long shrink_one_shrinker(struct shrinker *shrinker,
> +				  struct shrink_control *shrink,
> +				  unsigned long nr_pages_scanned,                      
> +				  unsigned long lru_pages);
> +
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7ef6912..50dfc61 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -211,6 +211,102 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
>  }
>  
>  #define SHRINK_BATCH 128
> +unsigned long shrink_one_shrinker(struct shrinker *shrinker,
> +				  struct shrink_control *shrink,
> +				  unsigned long nr_pages_scanned,
> +				  unsigned long lru_pages)
> +{

This isn't the right place to cut the code apart. The act of
shrinking a cache is separate to the act of calculating how much to
shrink. you're taking the vmscan interface and trying to hack around
that when direct calls to a single shrinker is needed.....

> +	unsigned long ret = 0;
> +	unsigned long long delta;
> +	unsigned long total_scan;
> +	unsigned long max_pass;
> +	int shrink_ret = 0;
> +	long nr;
> +	long new_nr;
> +	long batch_size = shrinker->batch ? shrinker->batch
> +					  : SHRINK_BATCH;
> +
> +	/*
> +	 * copy the current shrinker scan count into a local variable
> +	 * and zero it so that other concurrent shrinker invocations
> +	 * don't also do this scanning work.
> +	 */
> +	do {
> +		nr = shrinker->nr;
> +	} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
> +
> +	total_scan = nr;
> +	max_pass = do_shrinker_shrink(shrinker, shrink, 0);
> +	delta = (4 * nr_pages_scanned) / shrinker->seeks;
> +	delta *= max_pass;
> +	do_div(delta, lru_pages + 1);
> +	total_scan += delta;
> +	if (total_scan < 0) {
> +		printk(KERN_ERR "shrink_slab: %pF negative objects to "
> +		       "delete nr=%ld\n",
> +		       shrinker->shrink, total_scan);
> +		total_scan = max_pass;
> +	}
> +
> +	/*
> +	 * We need to avoid excessive windup on filesystem shrinkers
> +	 * due to large numbers of GFP_NOFS allocations causing the
> +	 * shrinkers to return -1 all the time. This results in a large
> +	 * nr being built up so when a shrink that can do some work
> +	 * comes along it empties the entire cache due to nr >>>
> +	 * max_pass.  This is bad for sustaining a working set in
> +	 * memory.
> +	 *
> +	 * Hence only allow the shrinker to scan the entire cache when
> +	 * a large delta change is calculated directly.
> +	 */
> +	if (delta < max_pass / 4)
> +		total_scan = min(total_scan, max_pass / 2);
> +
> +	/*
> +	 * Avoid risking looping forever due to too large nr value:
> +	 * never try to free more than twice the estimate number of
> +	 * freeable entries.
> +	 */
> +	if (total_scan > max_pass * 2)
> +		total_scan = max_pass * 2;
> +
> +	trace_mm_shrink_slab_start(shrinker, shrink, nr,
> +				nr_pages_scanned, lru_pages,
> +				max_pass, delta, total_scan);
> +

This bit from here:

> +	while (total_scan >= batch_size) {
> +		int nr_before;
> +
> +		nr_before = do_shrinker_shrink(shrinker, shrink, 0);
> +		shrink_ret = do_shrinker_shrink(shrinker, shrink,
> +						batch_size);
> +		if (shrink_ret == -1)
> +			break;
> +		if (shrink_ret < nr_before)
> +			ret += nr_before - shrink_ret;
> +		count_vm_events(SLABS_SCANNED, batch_size);
> +		total_scan -= batch_size;
> +
> +		cond_resched();
> +	}

to here is the only common code. It takes a locked shrinker and a
set up scan_control structure. As it is, I've got a set of patches
that I'm working on that fix the garbage srhinker API and abstract
this section of the code correctly for external users. I'll try to
get that finished for review later this week.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]     ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-15  7:03       ` Dave Chinner
  2011-08-15  7:12       ` Pekka Enberg
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  7:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 14, 2011 at 07:13:51PM +0400, Glauber Costa wrote:
> This patch lays the foundation for us to limit the dcache size.
> Each super block can have only a maximum amount of dentries under its
> sub-tree. Allocation fails if we we're over limit and the cache
> can't be pruned to free up space for the newcomers.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  fs/dcache.c        |   28 ++++++++++++++++++++++++++++
>  fs/super.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 815d9fd..ddd02a2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>  
> +static inline int dcache_mem_check(struct super_block *sb)
> +{
> +	struct shrink_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +	};
> +
> +	if (sb->s_nr_dentry_max == INT_MAX)
> +		return 0;
> +
> +	do {
> +		int nr_dentry;
> +
> +		nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
> +		if (nr_dentry > sb->s_nr_dentry_max)
> +			nr_dentry =
> +				percpu_counter_sum_positive(&sb->s_nr_dentry);
> +		if (nr_dentry < sb->s_nr_dentry_max)
> +			return 0;
> +
> +	/* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
> +	} while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
> +
> +	return -ENOMEM;
> +}

The changes here don't address the comments I made previously about
the error range of the per-cpu counter and update frequency of the
the global counter. w.r.t to small delta changes of the counted
amount.

That is, you cannot expect percpu_counter_read_positive() to change
when you decrement a counter by 2 counts (assuming both objects are
freed), and there's no guarantee that percpu_counter_sum_positive()
will drop below the threshold either while
percpu_counter_read_positive() is returning numbers above the
threshold. That makes this a very expensive and oft-repeated loop.

Indeed, that even assumes that 2 objects that are freed are
dentries. If there are more inodes that dentries in memory, the the
dentries won't even be trimmed, and you'll get stuck in the horribly
expensive loop trimming the cache 2 inodes at a time until the
caches start to balance. Shrinking must be done in large batches to
be effective, not one object at a time.

Further, the shrinker may not make progress freeing items, which
will result in this code returning ENOMEM when the shrinker may
simply have been passing over referenced (i.e. cache hot) dentries.
That will return spurious ENOMEM results back to d_alloc, resulting
in spurious failures that will be impossible to track down...

This also demonstrates my comment about where you factored
shrink_one_shrinker() to be the wrong place. You've already got a
shrink-control structure, so adding a value to .nr_to_scan to the
initialisation gets around the entire need to do all that guess work
of running through the vmscan specific algorithms to get 2 objects
freed.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-14 15:13     ` Glauber Costa
  (?)
  (?)
@ 2011-08-15  7:03     ` Dave Chinner
  -1 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  7:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet

On Sun, Aug 14, 2011 at 07:13:51PM +0400, Glauber Costa wrote:
> This patch lays the foundation for us to limit the dcache size.
> Each super block can have only a maximum amount of dentries under its
> sub-tree. Allocation fails if we we're over limit and the cache
> can't be pruned to free up space for the newcomers.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  fs/dcache.c        |   28 ++++++++++++++++++++++++++++
>  fs/super.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 815d9fd..ddd02a2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>  
> +static inline int dcache_mem_check(struct super_block *sb)
> +{
> +	struct shrink_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +	};
> +
> +	if (sb->s_nr_dentry_max == INT_MAX)
> +		return 0;
> +
> +	do {
> +		int nr_dentry;
> +
> +		nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
> +		if (nr_dentry > sb->s_nr_dentry_max)
> +			nr_dentry =
> +				percpu_counter_sum_positive(&sb->s_nr_dentry);
> +		if (nr_dentry < sb->s_nr_dentry_max)
> +			return 0;
> +
> +	/* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
> +	} while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
> +
> +	return -ENOMEM;
> +}

The changes here don't address the comments I made previously about
the error range of the per-cpu counter and update frequency of the
the global counter. w.r.t to small delta changes of the counted
amount.

That is, you cannot expect percpu_counter_read_positive() to change
when you decrement a counter by 2 counts (assuming both objects are
freed), and there's no guarantee that percpu_counter_sum_positive()
will drop below the threshold either while
percpu_counter_read_positive() is returning numbers above the
threshold. That makes this a very expensive and oft-repeated loop.

Indeed, that even assumes that 2 objects that are freed are
dentries. If there are more inodes that dentries in memory, the the
dentries won't even be trimmed, and you'll get stuck in the horribly
expensive loop trimming the cache 2 inodes at a time until the
caches start to balance. Shrinking must be done in large batches to
be effective, not one object at a time.

Further, the shrinker may not make progress freeing items, which
will result in this code returning ENOMEM when the shrinker may
simply have been passing over referenced (i.e. cache hot) dentries.
That will return spurious ENOMEM results back to d_alloc, resulting
in spurious failures that will be impossible to track down...

This also demonstrates my comment about where you factored
shrink_one_shrinker() to be the wrong place. You've already got a
shrink-control structure, so adding a value to .nr_to_scan to the
initialisation gets around the entire need to do all that guess work
of running through the vmscan specific algorithms to get 2 objects
freed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
       [not found]     ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-08-14 15:39       ` Vasiliy Kulikov
@ 2011-08-15  7:09       ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  7:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote:
> This patch introduces a simple generic vfs option parser.
> Right now, the only option we have is to limit the size of the dcache.
> 
> So any user that wants to have a dcache entries limit, can specify:
> 
>   mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>
> 
> It is supposed to work well with remounts, allowing it to change
> multiple over the course of the filesystem's lifecycle.
> 
> I find mount a natural interface for handling filesystem options,
> so that's what I've choosen. Feel free to yell at it at will if
> you disagree.

It's already been noted by both myself and Al that there is not a
1:1 mapping between mount point and superblocks....

> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
>  	else
>  		retval = do_new_mount(&path, type_page, flags, mnt_flags,
>  				      dev_name, data_page);
> +
> +	/* bind mounts get to respect their parents decision */
> +	if (!retval && !(flags & MS_BIND))
> +		vfs_set_dcache_size(path.mnt->mnt_sb,
> +				    vfs_options.vfs_dcache_size);
> +			

And I'm not sure that silently ignoring it in certain cases
is the best way around that problem.

>  dput_out:
>  	path_put(&path);
>  	return retval;
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index d37d2a7..1a309f3 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *);
>  
>  /* only used at mount-time */
>  extern struct dentry * d_alloc_root(struct inode *);
> +extern int vfs_set_dcache_size(struct super_block *sb, int size);
> +
> +#define DCACHE_MIN_SIZE 1024
> +extern int vfs_set_dcache_size(struct super_block *sb, int size);

Where did this minimum size come from and why was that figure
chosen? Needs documenting.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-14 15:13     ` Glauber Costa
                       ` (2 preceding siblings ...)
  (?)
@ 2011-08-15  7:09     ` Dave Chinner
  2011-08-24  2:19       ` Glauber Costa
  2011-08-24  2:19       ` Glauber Costa
  -1 siblings, 2 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15  7:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet

On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote:
> This patch introduces a simple generic vfs option parser.
> Right now, the only option we have is to limit the size of the dcache.
> 
> So any user that wants to have a dcache entries limit, can specify:
> 
>   mount -o whatever_options,vfs_dcache_size=XXX <dev> <mntpoint>
> 
> It is supposed to work well with remounts, allowing it to change
> multiple over the course of the filesystem's lifecycle.
> 
> I find mount a natural interface for handling filesystem options,
> so that's what I've choosen. Feel free to yell at it at will if
> you disagree.

It's already been noted by both myself and Al that there is not a
1:1 mapping between mount point and superblocks....

> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
>  	else
>  		retval = do_new_mount(&path, type_page, flags, mnt_flags,
>  				      dev_name, data_page);
> +
> +	/* bind mounts get to respect their parents decision */
> +	if (!retval && !(flags & MS_BIND))
> +		vfs_set_dcache_size(path.mnt->mnt_sb,
> +				    vfs_options.vfs_dcache_size);
> +			

And I'm not sure that silently ignoring it in certain cases
is the best way around that problem.

>  dput_out:
>  	path_put(&path);
>  	return retval;
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index d37d2a7..1a309f3 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -251,6 +251,10 @@ extern int d_invalidate(struct dentry *);
>  
>  /* only used at mount-time */
>  extern struct dentry * d_alloc_root(struct inode *);
> +extern int vfs_set_dcache_size(struct super_block *sb, int size);
> +
> +#define DCACHE_MIN_SIZE 1024
> +extern int vfs_set_dcache_size(struct super_block *sb, int size);

Where did this minimum size come from and why was that figure
chosen? Needs documenting.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]     ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2011-08-15  7:03       ` Dave Chinner
@ 2011-08-15  7:12       ` Pekka Enberg
  1 sibling, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15  7:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov, David Chinner,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter

On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> This patch lays the foundation for us to limit the dcache size.
> Each super block can have only a maximum amount of dentries under its
> sub-tree. Allocation fails if we we're over limit and the cache
> can't be pruned to free up space for the newcomers.
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  fs/dcache.c        |   28 ++++++++++++++++++++++++++++
>  fs/super.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 30 insertions(+), 0 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 815d9fd..ddd02a2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>
> +static inline int dcache_mem_check(struct super_block *sb)
> +{
> +       struct shrink_control sc = {
> +               .gfp_mask = GFP_KERNEL,
> +       };
> +
> +       if (sb->s_nr_dentry_max == INT_MAX)
> +               return 0;
> +
> +       do {
> +               int nr_dentry;
> +
> +               nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
> +               if (nr_dentry > sb->s_nr_dentry_max)
> +                       nr_dentry =
> +                               percpu_counter_sum_positive(&sb->s_nr_dentry);
> +               if (nr_dentry < sb->s_nr_dentry_max)
> +                       return 0;
> +
> +       /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
> +       } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
> +
> +       return -ENOMEM;
> +}
> +
>  /**
>  * __d_alloc   -       allocate a dcache entry
>  * @sb: filesystem it will belong to
> @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>        struct dentry *dentry;
>        char *dname;
>
> +       if (dcache_mem_check(sb))
> +               return NULL;
> +
>        dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
>        if (!dentry)
>                return NULL;
> diff --git a/fs/super.c b/fs/super.c
> index e95ac4f..3db40fb 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
>                }
>
>                percpu_counter_init(&s->s_nr_dentry, 0);
> +               s->s_nr_dentry_max = INT_MAX;
>
>  #ifdef CONFIG_SMP
>                s->s_files = alloc_percpu(struct list_head);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8150f52..bc773c3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1400,6 +1400,7 @@ struct super_block {
>        int                     s_nr_dentry_unused;     /* # of dentry on lru */
>
>        struct percpu_counter   s_nr_dentry;            /* # of dentry on this sb */
> +       int                     s_nr_dentry_max;        /* max # of dentry on this sb*/
>
>        /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
>        spinlock_t              s_inode_lru_lock ____cacheline_aligned_in_smp;

We track the total number of objects in mm/slub.c when
CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code).
Have you considered extending that for this purpose?

                                Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-14 15:13     ` Glauber Costa
                       ` (2 preceding siblings ...)
  (?)
@ 2011-08-15  7:12     ` Pekka Enberg
       [not found]       ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-08-15 10:46       ` Dave Chinner
  -1 siblings, 2 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15  7:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, David Chinner,
	Eric Dumazet, Christoph Lameter, David Rientjes

On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer@parallels.com> wrote:
> This patch lays the foundation for us to limit the dcache size.
> Each super block can have only a maximum amount of dentries under its
> sub-tree. Allocation fails if we we're over limit and the cache
> can't be pruned to free up space for the newcomers.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  fs/dcache.c        |   28 ++++++++++++++++++++++++++++
>  fs/super.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 30 insertions(+), 0 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 815d9fd..ddd02a2 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1180,6 +1180,31 @@ void shrink_dcache_parent(struct dentry * parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>
> +static inline int dcache_mem_check(struct super_block *sb)
> +{
> +       struct shrink_control sc = {
> +               .gfp_mask = GFP_KERNEL,
> +       };
> +
> +       if (sb->s_nr_dentry_max == INT_MAX)
> +               return 0;
> +
> +       do {
> +               int nr_dentry;
> +
> +               nr_dentry = percpu_counter_read_positive(&sb->s_nr_dentry);
> +               if (nr_dentry > sb->s_nr_dentry_max)
> +                       nr_dentry =
> +                               percpu_counter_sum_positive(&sb->s_nr_dentry);
> +               if (nr_dentry < sb->s_nr_dentry_max)
> +                       return 0;
> +
> +       /* nr_pages = 1, lru_pages = 0 should get delta ~ 2 */
> +       } while (shrink_one_shrinker(&sb->s_shrink, &sc, 1, 0));
> +
> +       return -ENOMEM;
> +}
> +
>  /**
>  * __d_alloc   -       allocate a dcache entry
>  * @sb: filesystem it will belong to
> @@ -1195,6 +1220,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>        struct dentry *dentry;
>        char *dname;
>
> +       if (dcache_mem_check(sb))
> +               return NULL;
> +
>        dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
>        if (!dentry)
>                return NULL;
> diff --git a/fs/super.c b/fs/super.c
> index e95ac4f..3db40fb 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -121,6 +121,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
>                }
>
>                percpu_counter_init(&s->s_nr_dentry, 0);
> +               s->s_nr_dentry_max = INT_MAX;
>
>  #ifdef CONFIG_SMP
>                s->s_files = alloc_percpu(struct list_head);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8150f52..bc773c3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1400,6 +1400,7 @@ struct super_block {
>        int                     s_nr_dentry_unused;     /* # of dentry on lru */
>
>        struct percpu_counter   s_nr_dentry;            /* # of dentry on this sb */
> +       int                     s_nr_dentry_max;        /* max # of dentry on this sb*/
>
>        /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
>        spinlock_t              s_inode_lru_lock ____cacheline_aligned_in_smp;

We track the total number of objects in mm/slub.c when
CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code).
Have you considered extending that for this purpose?

                                Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]       ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 10:46         ` Dave Chinner
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15 10:46 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On Mon, Aug 15, 2011 at 10:12:06AM +0300, Pekka Enberg wrote:
> On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> > This patch lays the foundation for us to limit the dcache size.
> > Each super block can have only a maximum amount of dentries under its
> > sub-tree. Allocation fails if we we're over limit and the cache
> > can't be pruned to free up space for the newcomers.
.....
> We track the total number of objects in mm/slub.c when
> CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code).
> Have you considered extending that for this purpose?

That's usage for the entire slab, though, and we don't have a dentry
slab per superblock so I don't think that helps us. And with slab
merging, I think that even if we did have a slab per superblock,
they'd end up in the same slab context anyway, right?

Ideally what we need is a slab, LRU and shrinkers all rolled into a
single infrastructure handle so we can simply set them up per
object, per context etc and not have to re-invent the wheel for
every single slab cache/LRU/shrinker setup we have in the kernel.

I've got a rough node-aware generic LRU/shrinker infrastructure
prototype that is generic enough for most of the existing slab
caches with shrinkers, but I haven't looked at what is needed to
integrate it with the slab cache code. That's mainly because I don't
like the idea of having to implement the same thing 3 times in 3
different ways and debug them all before anyone would consider it
for inclusion in the kernel.

Once I've sorted out the select_parent() use-the-LRU-for-disposal
abuse and have a patch set that survives a 'rm -rf *' operation,
maybe we can then talk about what is needed to integrate stuff into
the slab caches....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15  7:12     ` Pekka Enberg
       [not found]       ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 10:46       ` Dave Chinner
  2011-08-15 10:58         ` Pekka Enberg
  2011-08-15 10:58         ` Pekka Enberg
  1 sibling, 2 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-15 10:46 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Glauber Costa, linux-kernel, linux-fsdevel, containers,
	Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin,
	Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley,
	Eric Dumazet, Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 10:12:06AM +0300, Pekka Enberg wrote:
> On Sun, Aug 14, 2011 at 6:13 PM, Glauber Costa <glommer@parallels.com> wrote:
> > This patch lays the foundation for us to limit the dcache size.
> > Each super block can have only a maximum amount of dentries under its
> > sub-tree. Allocation fails if we we're over limit and the cache
> > can't be pruned to free up space for the newcomers.
.....
> We track the total number of objects in mm/slub.c when
> CONFIG_SLUB_DEBUG is enabled (look for n->total_objects in the code).
> Have you considered extending that for this purpose?

That's usage for the entire slab, though, and we don't have a dentry
slab per superblock so I don't think that helps us. And with slab
merging, I think that even if we did have a slab per superblock,
they'd end up in the same slab context anyway, right?

Ideally what we need is a slab, LRU and shrinkers all rolled into a
single infrastructure handle so we can simply set them up per
object, per context etc and not have to re-invent the wheel for
every single slab cache/LRU/shrinker setup we have in the kernel.

I've got a rough node-aware generic LRU/shrinker infrastructure
prototype that is generic enough for most of the existing slab
caches with shrinkers, but I haven't looked at what is needed to
integrate it with the slab cache code. That's mainly because I don't
like the idea of having to implement the same thing 3 times in 3
different ways and debug them all before anyone would consider it
for inclusion in the kernel.

Once I've sorted out the select_parent() use-the-LRU-for-disposal
abuse and have a patch set that survives a 'rm -rf *' operation,
maybe we can then talk about what is needed to integrate stuff into
the slab caches....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 10:46       ` Dave Chinner
  2011-08-15 10:58         ` Pekka Enberg
@ 2011-08-15 10:58         ` Pekka Enberg
  1 sibling, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 10:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

Hi Dave,

On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> That's usage for the entire slab, though, and we don't have a dentry
> slab per superblock so I don't think that helps us. And with slab
> merging, I think that even if we did have a slab per superblock,
> they'd end up in the same slab context anyway, right?

You could add a flag to disable slab merging but there's no sane way
to fix the per-superblock thing in slab.

On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> Ideally what we need is a slab, LRU and shrinkers all rolled into a
> single infrastructure handle so we can simply set them up per
> object, per context etc and not have to re-invent the wheel for
> every single slab cache/LRU/shrinker setup we have in the kernel.
>
> I've got a rough node-aware generic LRU/shrinker infrastructure
> prototype that is generic enough for most of the existing slab
> caches with shrinkers, but I haven't looked at what is needed to
> integrate it with the slab cache code. That's mainly because I don't
> like the idea of having to implement the same thing 3 times in 3
> different ways and debug them all before anyone would consider it
> for inclusion in the kernel.
>
> Once I've sorted out the select_parent() use-the-LRU-for-disposal
> abuse and have a patch set that survives a 'rm -rf *' operation,
> maybe we can then talk about what is needed to integrate stuff into
> the slab caches....

Well, now that I really understand what you're trying to do here, it's
probably best to keep slab as-is and implement "slab accounting" on
top of it.

You'd have something like you do now but in slightly more generic form:

  struct kmem_accounted_cache {
                  struct kmem_cache *cache;
                  /* ... statistics... */
  }

  void *kmem_accounted_alloc(struct kmem_accounted_cache *c)
  {
          if (/* within limits */)
                  return kmem_cache_alloc(c->cache);

          return NULL;
  }

Does something like that make sense to you?

                        Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 10:46       ` Dave Chinner
@ 2011-08-15 10:58         ` Pekka Enberg
       [not found]           ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-08-15 11:05             ` Pavel Emelyanov
  2011-08-15 10:58         ` Pekka Enberg
  1 sibling, 2 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 10:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, linux-kernel, linux-fsdevel, containers,
	Pavel Emelyanov, Al Viro, Hugh Dickins, Nick Piggin,
	Andrea Arcangeli, Rik van Riel, Dave Hansen, James Bottomley,
	Eric Dumazet, Christoph Lameter, David Rientjes

Hi Dave,

On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> That's usage for the entire slab, though, and we don't have a dentry
> slab per superblock so I don't think that helps us. And with slab
> merging, I think that even if we did have a slab per superblock,
> they'd end up in the same slab context anyway, right?

You could add a flag to disable slab merging but there's no sane way
to fix the per-superblock thing in slab.

On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> Ideally what we need is a slab, LRU and shrinkers all rolled into a
> single infrastructure handle so we can simply set them up per
> object, per context etc and not have to re-invent the wheel for
> every single slab cache/LRU/shrinker setup we have in the kernel.
>
> I've got a rough node-aware generic LRU/shrinker infrastructure
> prototype that is generic enough for most of the existing slab
> caches with shrinkers, but I haven't looked at what is needed to
> integrate it with the slab cache code. That's mainly because I don't
> like the idea of having to implement the same thing 3 times in 3
> different ways and debug them all before anyone would consider it
> for inclusion in the kernel.
>
> Once I've sorted out the select_parent() use-the-LRU-for-disposal
> abuse and have a patch set that survives a 'rm -rf *' operation,
> maybe we can then talk about what is needed to integrate stuff into
> the slab caches....

Well, now that I really understand what you're trying to do here, it's
probably best to keep slab as-is and implement "slab accounting" on
top of it.

You'd have something like you do now but in slightly more generic form:

  struct kmem_accounted_cache {
                  struct kmem_cache *cache;
                  /* ... statistics... */
  }

  void *kmem_accounted_alloc(struct kmem_accounted_cache *c)
  {
          if (/* within limits */)
                  return kmem_cache_alloc(c->cache);

          return NULL;
  }

Does something like that make sense to you?

                        Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]           ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 11:05             ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On 08/15/2011 02:58 PM, Pekka Enberg wrote:
> Hi Dave,
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>> That's usage for the entire slab, though, and we don't have a dentry
>> slab per superblock so I don't think that helps us. And with slab
>> merging, I think that even if we did have a slab per superblock,
>> they'd end up in the same slab context anyway, right?
> 
> You could add a flag to disable slab merging but there's no sane way
> to fix the per-superblock thing in slab.
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>> Ideally what we need is a slab, LRU and shrinkers all rolled into a
>> single infrastructure handle so we can simply set them up per
>> object, per context etc and not have to re-invent the wheel for
>> every single slab cache/LRU/shrinker setup we have in the kernel.
>>
>> I've got a rough node-aware generic LRU/shrinker infrastructure
>> prototype that is generic enough for most of the existing slab
>> caches with shrinkers, but I haven't looked at what is needed to
>> integrate it with the slab cache code. That's mainly because I don't
>> like the idea of having to implement the same thing 3 times in 3
>> different ways and debug them all before anyone would consider it
>> for inclusion in the kernel.
>>
>> Once I've sorted out the select_parent() use-the-LRU-for-disposal
>> abuse and have a patch set that survives a 'rm -rf *' operation,
>> maybe we can then talk about what is needed to integrate stuff into
>> the slab caches....
> 
> Well, now that I really understand what you're trying to do here, it's
> probably best to keep slab as-is and implement "slab accounting" on
> top of it.
> 
> You'd have something like you do now but in slightly more generic form:
> 
>   struct kmem_accounted_cache {
>                   struct kmem_cache *cache;
>                   /* ... statistics... */
>   }
> 
>   void *kmem_accounted_alloc(struct kmem_accounted_cache *c)
>   {
>           if (/* within limits */)
>                   return kmem_cache_alloc(c->cache);
> 
>           return NULL;
>   }
> 
> Does something like that make sense to you?

This will make sense, since the kernel memory management per-cgroup is one of the
things we'd live to have, but this particular idea will definitely not work in case
we keep the containers' files on one partition keeping each container in its own 
chroot environment.

>                         Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 10:58         ` Pekka Enberg
@ 2011-08-15 11:05             ` Pavel Emelyanov
  2011-08-15 11:05             ` Pavel Emelyanov
  1 sibling, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 02:58 PM, Pekka Enberg wrote:
> Hi Dave,
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> That's usage for the entire slab, though, and we don't have a dentry
>> slab per superblock so I don't think that helps us. And with slab
>> merging, I think that even if we did have a slab per superblock,
>> they'd end up in the same slab context anyway, right?
> 
> You could add a flag to disable slab merging but there's no sane way
> to fix the per-superblock thing in slab.
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> Ideally what we need is a slab, LRU and shrinkers all rolled into a
>> single infrastructure handle so we can simply set them up per
>> object, per context etc and not have to re-invent the wheel for
>> every single slab cache/LRU/shrinker setup we have in the kernel.
>>
>> I've got a rough node-aware generic LRU/shrinker infrastructure
>> prototype that is generic enough for most of the existing slab
>> caches with shrinkers, but I haven't looked at what is needed to
>> integrate it with the slab cache code. That's mainly because I don't
>> like the idea of having to implement the same thing 3 times in 3
>> different ways and debug them all before anyone would consider it
>> for inclusion in the kernel.
>>
>> Once I've sorted out the select_parent() use-the-LRU-for-disposal
>> abuse and have a patch set that survives a 'rm -rf *' operation,
>> maybe we can then talk about what is needed to integrate stuff into
>> the slab caches....
> 
> Well, now that I really understand what you're trying to do here, it's
> probably best to keep slab as-is and implement "slab accounting" on
> top of it.
> 
> You'd have something like you do now but in slightly more generic form:
> 
>   struct kmem_accounted_cache {
>                   struct kmem_cache *cache;
>                   /* ... statistics... */
>   }
> 
>   void *kmem_accounted_alloc(struct kmem_accounted_cache *c)
>   {
>           if (/* within limits */)
>                   return kmem_cache_alloc(c->cache);
> 
>           return NULL;
>   }
> 
> Does something like that make sense to you?

This will make sense, since the kernel memory management per-cgroup is one of the
things we'd live to have, but this particular idea will definitely not work in case
we keep the containers' files on one partition keeping each container in its own 
chroot environment.

>                         Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 11:05             ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 02:58 PM, Pekka Enberg wrote:
> Hi Dave,
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> That's usage for the entire slab, though, and we don't have a dentry
>> slab per superblock so I don't think that helps us. And with slab
>> merging, I think that even if we did have a slab per superblock,
>> they'd end up in the same slab context anyway, right?
> 
> You could add a flag to disable slab merging but there's no sane way
> to fix the per-superblock thing in slab.
> 
> On Mon, Aug 15, 2011 at 1:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> Ideally what we need is a slab, LRU and shrinkers all rolled into a
>> single infrastructure handle so we can simply set them up per
>> object, per context etc and not have to re-invent the wheel for
>> every single slab cache/LRU/shrinker setup we have in the kernel.
>>
>> I've got a rough node-aware generic LRU/shrinker infrastructure
>> prototype that is generic enough for most of the existing slab
>> caches with shrinkers, but I haven't looked at what is needed to
>> integrate it with the slab cache code. That's mainly because I don't
>> like the idea of having to implement the same thing 3 times in 3
>> different ways and debug them all before anyone would consider it
>> for inclusion in the kernel.
>>
>> Once I've sorted out the select_parent() use-the-LRU-for-disposal
>> abuse and have a patch set that survives a 'rm -rf *' operation,
>> maybe we can then talk about what is needed to integrate stuff into
>> the slab caches....
> 
> Well, now that I really understand what you're trying to do here, it's
> probably best to keep slab as-is and implement "slab accounting" on
> top of it.
> 
> You'd have something like you do now but in slightly more generic form:
> 
>   struct kmem_accounted_cache {
>                   struct kmem_cache *cache;
>                   /* ... statistics... */
>   }
> 
>   void *kmem_accounted_alloc(struct kmem_accounted_cache *c)
>   {
>           if (/* within limits */)
>                   return kmem_cache_alloc(c->cache);
> 
>           return NULL;
>   }
> 
> Does something like that make sense to you?

This will make sense, since the kernel memory management per-cgroup is one of the
things we'd live to have, but this particular idea will definitely not work in case
we keep the containers' files on one partition keeping each container in its own 
chroot environment.

>                         Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]             ` <4E48FD8A.90401-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-15 11:14               ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

Hi Pavel,

On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> This will make sense, since the kernel memory management per-cgroup is one of the
> things we'd live to have, but this particular idea will definitely not work in case
> we keep the containers' files on one partition keeping each container in its own
> chroot environment.

And you want a per-container dcache limit? Will the containers share
the same superblock? Couldn't you simply do per-container "struct
kmem_accounted_cache" in struct superblock?

                        Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 11:05             ` Pavel Emelyanov
@ 2011-08-15 11:14               ` Pekka Enberg
  -1 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

Hi Pavel,

On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> This will make sense, since the kernel memory management per-cgroup is one of the
> things we'd live to have, but this particular idea will definitely not work in case
> we keep the containers' files on one partition keeping each container in its own
> chroot environment.

And you want a per-container dcache limit? Will the containers share
the same superblock? Couldn't you simply do per-container "struct
kmem_accounted_cache" in struct superblock?

                        Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 11:14               ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

Hi Pavel,

On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> This will make sense, since the kernel memory management per-cgroup is one of the
> things we'd live to have, but this particular idea will definitely not work in case
> we keep the containers' files on one partition keeping each container in its own
> chroot environment.

And you want a per-container dcache limit? Will the containers share
the same superblock? Couldn't you simply do per-container "struct
kmem_accounted_cache" in struct superblock?

                        Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]               ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 11:32                 ` Pavel Emelyanov
  2011-08-16  2:11                 ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On 08/15/2011 03:14 PM, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>> This will make sense, since the kernel memory management per-cgroup is one of the
>> things we'd live to have, but this particular idea will definitely not work in case
>> we keep the containers' files on one partition keeping each container in its own
>> chroot environment.
> 
> And you want a per-container dcache limit? 

To be more specific - we want to protect the node with >1 containers from one of
them growing the dcache infinitely. One of the solutions to this - per container
dcache limit.

> Will the containers share the same superblock? 

Yes, this is typical scenario for both OpenVZ and LXC now.

> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?

If by this you mean "account for all the kmem associated with particular superblock" then
this is OK for us, but this can't be done in a simple

	if (used + size > limit)
		return -ENOMEM
	else {
		used += size;
		return 0;
	}

manner, since once we hit the limit we should shrink the unused dentries. And most of the
patches are about this.

>                         Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 11:14               ` Pekka Enberg
@ 2011-08-15 11:32                 ` Pavel Emelyanov
  -1 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 03:14 PM, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> This will make sense, since the kernel memory management per-cgroup is one of the
>> things we'd live to have, but this particular idea will definitely not work in case
>> we keep the containers' files on one partition keeping each container in its own
>> chroot environment.
> 
> And you want a per-container dcache limit? 

To be more specific - we want to protect the node with >1 containers from one of
them growing the dcache infinitely. One of the solutions to this - per container
dcache limit.

> Will the containers share the same superblock? 

Yes, this is typical scenario for both OpenVZ and LXC now.

> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?

If by this you mean "account for all the kmem associated with particular superblock" then
this is OK for us, but this can't be done in a simple

	if (used + size > limit)
		return -ENOMEM
	else {
		used += size;
		return 0;
	}

manner, since once we hit the limit we should shrink the unused dentries. And most of the
patches are about this.

>                         Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 11:32                 ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 11:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 03:14 PM, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> This will make sense, since the kernel memory management per-cgroup is one of the
>> things we'd live to have, but this particular idea will definitely not work in case
>> we keep the containers' files on one partition keeping each container in its own
>> chroot environment.
> 
> And you want a per-container dcache limit? 

To be more specific - we want to protect the node with >1 containers from one of
them growing the dcache infinitely. One of the solutions to this - per container
dcache limit.

> Will the containers share the same superblock? 

Yes, this is typical scenario for both OpenVZ and LXC now.

> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?

If by this you mean "account for all the kmem associated with particular superblock" then
this is OK for us, but this can't be done in a simple

	if (used + size > limit)
		return -ENOMEM
	else {
		used += size;
		return 0;
	}

manner, since once we hit the limit we should shrink the unused dentries. And most of the
patches are about this.

>                         Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]                 ` <4E4903C1.9050207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-15 11:55                   ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>
> If by this you mean "account for all the kmem associated with particular superblock" then
> this is OK for us, but this can't be done in a simple
>
>        if (used + size > limit)
>                return -ENOMEM
>        else {
>                used += size;
>                return 0;
>        }
>
> manner, since once we hit the limit we should shrink the unused dentries. And most of the
> patches are about this.

So you want to shrink dentries that were allocated in the container
that hit the dcache limit? How does this patch set deal with that? Or
are you referring to some other patches?

                         Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 11:32                 ` Pavel Emelyanov
@ 2011-08-15 11:55                   ` Pekka Enberg
  -1 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>
> If by this you mean "account for all the kmem associated with particular superblock" then
> this is OK for us, but this can't be done in a simple
>
>        if (used + size > limit)
>                return -ENOMEM
>        else {
>                used += size;
>                return 0;
>        }
>
> manner, since once we hit the limit we should shrink the unused dentries. And most of the
> patches are about this.

So you want to shrink dentries that were allocated in the container
that hit the dcache limit? How does this patch set deal with that? Or
are you referring to some other patches?

                         Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 11:55                   ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 11:55 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>
> If by this you mean "account for all the kmem associated with particular superblock" then
> this is OK for us, but this can't be done in a simple
>
>        if (used + size > limit)
>                return -ENOMEM
>        else {
>                used += size;
>                return 0;
>        }
>
> manner, since once we hit the limit we should shrink the unused dentries. And most of the
> patches are about this.

So you want to shrink dentries that were allocated in the container
that hit the dcache limit? How does this patch set deal with that? Or
are you referring to some other patches?

                         Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]                   ` <CAOJsxLErFcxuuqnWkRbOAHEFbeGrKp3YMZ-144=m16oBXCHJ2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 12:12                     ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On 08/15/2011 03:55 PM, Pekka Enberg wrote:
> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>
>> If by this you mean "account for all the kmem associated with particular superblock" then
>> this is OK for us, but this can't be done in a simple
>>
>>        if (used + size > limit)
>>                return -ENOMEM
>>        else {
>>                used += size;
>>                return 0;
>>        }
>>
>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>> patches are about this.
> 
> So you want to shrink dentries that were allocated in the container
> that hit the dcache limit? How does this patch set deal with that? Or
> are you referring to some other patches?

Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
http://lwn.net/Articles/441164/

>                          Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 11:55                   ` Pekka Enberg
@ 2011-08-15 12:12                     ` Pavel Emelyanov
  -1 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 03:55 PM, Pekka Enberg wrote:
> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>
>> If by this you mean "account for all the kmem associated with particular superblock" then
>> this is OK for us, but this can't be done in a simple
>>
>>        if (used + size > limit)
>>                return -ENOMEM
>>        else {
>>                used += size;
>>                return 0;
>>        }
>>
>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>> patches are about this.
> 
> So you want to shrink dentries that were allocated in the container
> that hit the dcache limit? How does this patch set deal with that? Or
> are you referring to some other patches?

Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
http://lwn.net/Articles/441164/

>                          Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 12:12                     ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On 08/15/2011 03:55 PM, Pekka Enberg wrote:
> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>
>> If by this you mean "account for all the kmem associated with particular superblock" then
>> this is OK for us, but this can't be done in a simple
>>
>>        if (used + size > limit)
>>                return -ENOMEM
>>        else {
>>                used += size;
>>                return 0;
>>        }
>>
>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>> patches are about this.
> 
> So you want to shrink dentries that were allocated in the container
> that hit the dcache limit? How does this patch set deal with that? Or
> are you referring to some other patches?

Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
http://lwn.net/Articles/441164/

>                          Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]                     ` <4E490D47.8050105-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-15 12:23                       ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> On 08/15/2011 03:55 PM, Pekka Enberg wrote:
>> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>>
>>> If by this you mean "account for all the kmem associated with particular superblock" then
>>> this is OK for us, but this can't be done in a simple
>>>
>>>        if (used + size > limit)
>>>                return -ENOMEM
>>>        else {
>>>                used += size;
>>>                return 0;
>>>        }
>>>
>>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>>> patches are about this.
>>
>> So you want to shrink dentries that were allocated in the container
>> that hit the dcache limit? How does this patch set deal with that? Or
>> are you referring to some other patches?
>
> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
> http://lwn.net/Articles/441164/

OK. I suppose it's best to keep it separate of the slab code for now
then. Maybe you're able to come with something more generic that we
could make part of slab APIs later.

                            Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 12:12                     ` Pavel Emelyanov
@ 2011-08-15 12:23                       ` Pekka Enberg
  -1 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 08/15/2011 03:55 PM, Pekka Enberg wrote:
>> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>>
>>> If by this you mean "account for all the kmem associated with particular superblock" then
>>> this is OK for us, but this can't be done in a simple
>>>
>>>        if (used + size > limit)
>>>                return -ENOMEM
>>>        else {
>>>                used += size;
>>>                return 0;
>>>        }
>>>
>>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>>> patches are about this.
>>
>> So you want to shrink dentries that were allocated in the container
>> that hit the dcache limit? How does this patch set deal with that? Or
>> are you referring to some other patches?
>
> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
> http://lwn.net/Articles/441164/

OK. I suppose it's best to keep it separate of the slab code for now
then. Maybe you're able to come with something more generic that we
could make part of slab APIs later.

                            Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 12:23                       ` Pekka Enberg
  0 siblings, 0 replies; 65+ messages in thread
From: Pekka Enberg @ 2011-08-15 12:23 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 3:12 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 08/15/2011 03:55 PM, Pekka Enberg wrote:
>> On Mon, Aug 15, 2011 at 2:32 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>>> Couldn't you simply do per-container "struct kmem_accounted_cache" in struct superblock?
>>>
>>> If by this you mean "account for all the kmem associated with particular superblock" then
>>> this is OK for us, but this can't be done in a simple
>>>
>>>        if (used + size > limit)
>>>                return -ENOMEM
>>>        else {
>>>                used += size;
>>>                return 0;
>>>        }
>>>
>>> manner, since once we hit the limit we should shrink the unused dentries. And most of the
>>> patches are about this.
>>
>> So you want to shrink dentries that were allocated in the container
>> that hit the dcache limit? How does this patch set deal with that? Or
>> are you referring to some other patches?
>
> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
> http://lwn.net/Articles/441164/

OK. I suppose it's best to keep it separate of the slab code for now
then. Maybe you're able to come with something more generic that we
could make part of slab APIs later.

                            Pekka

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]                       ` <CAOJsxLEVfGrzUQv0fOpOyw3AaOLOHcWbvLJL1NdrHS6M2j5o1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-08-15 12:37                         ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Hugh Dickins,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

>> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
>> http://lwn.net/Articles/441164/
> 
> OK. I suppose it's best to keep it separate of the slab code for now
> then. Maybe you're able to come with something more generic that we
> could make part of slab APIs later.

Yup! We also have plans for having control over the kernel memory usage by containers, so
we'll certainly get back to it :)

>                             Pekka
> .
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 12:23                       ` Pekka Enberg
@ 2011-08-15 12:37                         ` Pavel Emelyanov
  -1 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

>> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
>> http://lwn.net/Articles/441164/
> 
> OK. I suppose it's best to keep it separate of the slab code for now
> then. Maybe you're able to come with something more generic that we
> could make part of slab APIs later.

Yup! We also have plans for having control over the kernel memory usage by containers, so
we'll certainly get back to it :)

>                             Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-15 12:37                         ` Pavel Emelyanov
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Emelyanov @ 2011-08-15 12:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Chinner, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

>> Yes, this patchset does only the API part. The full idea is implemented (with poor API though) here:
>> http://lwn.net/Articles/441164/
> 
> OK. I suppose it's best to keep it separate of the slab code for now
> then. Maybe you're able to come with something more generic that we
> could make part of slab APIs later.

Yup! We also have plans for having control over the kernel memory usage by containers, so
we'll certainly get back to it :)

>                             Pekka
> .
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
       [not found]               ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-08-15 11:32                 ` Pavel Emelyanov
@ 2011-08-16  2:11                 ` Dave Chinner
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-16  2:11 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Al Viro, David Rientjes,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Dave Hansen

On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> > This will make sense, since the kernel memory management per-cgroup is one of the
> > things we'd live to have, but this particular idea will definitely not work in case
> > we keep the containers' files on one partition keeping each container in its own
> > chroot environment.
> 
> And you want a per-container dcache limit? Will the containers share
> the same superblock?

Yes, and that's one of the problems with the "arbitrary container"
approach to controlling the dentry cache size. Arbitrary containers
don't map easily to predictable and scalable LRU and reclaim
implementations. Hence right now the container scope is limited to
per-superblock.

> Couldn't you simply do per-container "struct
> kmem_accounted_cache" in struct superblock?

Probably could do it that way, but it's still not really and
integrated solution. What we'll end up with is this LRU structure:

struct lru_node {
	struct list_head		lru;
	spinlock_t			lock;
	long				nr_items;
} ____cacheline_aligned_in_smp;

struct lru {
	struct kmem_accounted_cache	*cache;
	struct lru_node			lru_node[MAX_NUMNODES];
	nodemask_t			active_nodes;
	int (*isolate_item)(struct list_head *item);
	int (*dispose)(struct list_head *list);
}

Where the only thing that the lru->cache is used for is getting the
number of items allocated to the cache. Seems relatively pointless
to make that statistic abstraction for just a single value that we
can get via a simple per-cpu counter...

Then, when you consider SLUB has this structure for every individual
slab cache:

struct kmem_cache_node {
        spinlock_t list_lock;   /* Protect partial list and nr_partial */
        unsigned long nr_partial;
        struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
        atomic_long_t nr_slabs;
        atomic_long_t total_objects;
        struct list_head full;
#endif
};

you can see why tight integration of the per-node LRU infrastructure
is appealing - there's no unnecessary duplication and the accounting
is done in the right spot. It also means there is only one shrinker
implmentation for all slabs, with a couple of simple per-slab
callbacks for isolating objects for disposal and then to dispose of
them. This would mean that most slab caches that use shrinkers would
no longer need to implement their own LRU, get LRU scalability and
node-aware reclaim for free, have built in size limits, etc.

And FWIW, integrating the LRU shrinker mechanism into the slab cache
also provides the mechanisms needed for capping the size of the
cache as well as slab defragmentation.  Much smarter things can be
done when you know both the age and the locality of objects. e.g.
there's no point preventing allocation from a slab due to maximum
object count limitations if there are partial pages in the slab
cache because the allocation can be done without increasing memory
footprint.....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
  2011-08-15 11:14               ` Pekka Enberg
@ 2011-08-16  2:11                 ` Dave Chinner
  -1 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-16  2:11 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Pavel Emelyanov, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> > This will make sense, since the kernel memory management per-cgroup is one of the
> > things we'd live to have, but this particular idea will definitely not work in case
> > we keep the containers' files on one partition keeping each container in its own
> > chroot environment.
> 
> And you want a per-container dcache limit? Will the containers share
> the same superblock?

Yes, and that's one of the problems with the "arbitrary container"
approach to controlling the dentry cache size. Arbitrary containers
don't map easily to predictable and scalable LRU and reclaim
implementations. Hence right now the container scope is limited to
per-superblock.

> Couldn't you simply do per-container "struct
> kmem_accounted_cache" in struct superblock?

Probably could do it that way, but it's still not really and
integrated solution. What we'll end up with is this LRU structure:

struct lru_node {
	struct list_head		lru;
	spinlock_t			lock;
	long				nr_items;
} ____cacheline_aligned_in_smp;

struct lru {
	struct kmem_accounted_cache	*cache;
	struct lru_node			lru_node[MAX_NUMNODES];
	nodemask_t			active_nodes;
	int (*isolate_item)(struct list_head *item);
	int (*dispose)(struct list_head *list);
}

Where the only thing that the lru->cache is used for is getting the
number of items allocated to the cache. Seems relatively pointless
to make that statistic abstraction for just a single value that we
can get via a simple per-cpu counter...

Then, when you consider SLUB has this structure for every individual
slab cache:

struct kmem_cache_node {
        spinlock_t list_lock;   /* Protect partial list and nr_partial */
        unsigned long nr_partial;
        struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
        atomic_long_t nr_slabs;
        atomic_long_t total_objects;
        struct list_head full;
#endif
};

you can see why tight integration of the per-node LRU infrastructure
is appealing - there's no unnecessary duplication and the accounting
is done in the right spot. It also means there is only one shrinker
implmentation for all slabs, with a couple of simple per-slab
callbacks for isolating objects for disposal and then to dispose of
them. This would mean that most slab caches that use shrinkers would
no longer need to implement their own LRU, get LRU scalability and
node-aware reclaim for free, have built in size limits, etc.

And FWIW, integrating the LRU shrinker mechanism into the slab cache
also provides the mechanisms needed for capping the size of the
cache as well as slab defragmentation.  Much smarter things can be
done when you know both the age and the locality of objects. e.g.
there's no point preventing allocation from a slab due to maximum
object count limitations if there are partial pages in the slab
cache because the allocation can be done without increasing memory
footprint.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 3/4] limit nr_dentries per superblock
@ 2011-08-16  2:11                 ` Dave Chinner
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-16  2:11 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Pavel Emelyanov, Glauber Costa, linux-kernel, linux-fsdevel,
	containers, Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet,
	Christoph Lameter, David Rientjes

On Mon, Aug 15, 2011 at 02:14:39PM +0300, Pekka Enberg wrote:
> Hi Pavel,
> 
> On Mon, Aug 15, 2011 at 2:05 PM, Pavel Emelyanov <xemul@parallels.com> wrote:
> > This will make sense, since the kernel memory management per-cgroup is one of the
> > things we'd live to have, but this particular idea will definitely not work in case
> > we keep the containers' files on one partition keeping each container in its own
> > chroot environment.
> 
> And you want a per-container dcache limit? Will the containers share
> the same superblock?

Yes, and that's one of the problems with the "arbitrary container"
approach to controlling the dentry cache size. Arbitrary containers
don't map easily to predictable and scalable LRU and reclaim
implementations. Hence right now the container scope is limited to
per-superblock.

> Couldn't you simply do per-container "struct
> kmem_accounted_cache" in struct superblock?

Probably could do it that way, but it's still not really and
integrated solution. What we'll end up with is this LRU structure:

struct lru_node {
	struct list_head		lru;
	spinlock_t			lock;
	long				nr_items;
} ____cacheline_aligned_in_smp;

struct lru {
	struct kmem_accounted_cache	*cache;
	struct lru_node			lru_node[MAX_NUMNODES];
	nodemask_t			active_nodes;
	int (*isolate_item)(struct list_head *item);
	int (*dispose)(struct list_head *list);
}

Where the only thing that the lru->cache is used for is getting the
number of items allocated to the cache. Seems relatively pointless
to make that statistic abstraction for just a single value that we
can get via a simple per-cpu counter...

Then, when you consider SLUB has this structure for every individual
slab cache:

struct kmem_cache_node {
        spinlock_t list_lock;   /* Protect partial list and nr_partial */
        unsigned long nr_partial;
        struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
        atomic_long_t nr_slabs;
        atomic_long_t total_objects;
        struct list_head full;
#endif
};

you can see why tight integration of the per-node LRU infrastructure
is appealing - there's no unnecessary duplication and the accounting
is done in the right spot. It also means there is only one shrinker
implmentation for all slabs, with a couple of simple per-slab
callbacks for isolating objects for disposal and then to dispose of
them. This would mean that most slab caches that use shrinkers would
no longer need to implement their own LRU, get LRU scalability and
node-aware reclaim for free, have built in size limits, etc.

And FWIW, integrating the LRU shrinker mechanism into the slab cache
also provides the mechanisms needed for capping the size of the
cache as well as slab defragmentation.  Much smarter things can be
done when you know both the age and the locality of objects. e.g.
there's no point preventing allocation from a slab due to maximum
object count limitations if there are partial pages in the slab
cache because the allocation can be done without increasing memory
footprint.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
       [not found] ` <1313334832-1150-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2011-08-14 15:13     ` Glauber Costa
@ 2011-08-17  5:43   ` Dave Chinner
  4 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-17  5:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
> Hello,
> 
> This series is just like v2, except it addresses
> Eric's comments regarding percpu variables.
> 
> Let me know if there are further comments, and
> I'll promply address them as well. Otherwise,
> I feel this is ready for inclusion

Just out of couriousity, one thing I've noticed about dentries is
that in general at any given point in time most dentries are unused.
Under the workloads I'm testing, even when I have a million cached
dentries, I only have roughly 7,000 accounted as used.  That is, most
of the dentries in the system are on a LRU and accounted in
sb->s_nr_dentry_unused of their owner superblock.

So rather than introduce a bunch of new infrastructure to track the
number of dentries allocated, why not simply limit the number of
dentries allowed on the LRU? We already track that, and the shrinker
already operates on the LRU, so we don't really need any new
infrastructure.

The limiting can be lazily - we don't need to limit the growth of
dentries until we start to run out of memory. If the superblock
shrinker is aware of the limits, then when it gets called by memory
reclaim it can do all the work of reducing the number of items on
the LRU down to the threshold at that time.

IOWs, the limit has no impact on performance until memory is scarce,
at which time memory reclaim enforces the limits on LRU size and
clean up happens automatically.

This also avoids all the problems of setting a limit lower than the
number of active dentries required for the workload (i.e. avoids
spurious ENOMEM errors trying to allocate dentries), allows
overcommitment when memory is plentiful (which will benefit
performance) but it brings the caches back to defined limits when
memory is not plentiful (which solves the problem you are having).

That seems like a much better solution to me, because it doesn't
impact the normal working of workloads in the containers and avoids
having to work out what the correct minimum size of the cache for
each workload is. It's much easier to work out how many extra cached
dentries you want the container to be able to have when memory is
scarce...

What do you think?

Cheers,

Dave.

-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-14 15:13 ` Glauber Costa
  (?)
  (?)
@ 2011-08-17  5:43 ` Dave Chinner
  2011-08-17 18:44   ` Glauber Costa
  2011-08-17 18:44   ` Glauber Costa
  -1 siblings, 2 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-17  5:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
> Hello,
> 
> This series is just like v2, except it addresses
> Eric's comments regarding percpu variables.
> 
> Let me know if there are further comments, and
> I'll promply address them as well. Otherwise,
> I feel this is ready for inclusion

Just out of couriousity, one thing I've noticed about dentries is
that in general at any given point in time most dentries are unused.
Under the workloads I'm testing, even when I have a million cached
dentries, I only have roughly 7,000 accounted as used.  That is, most
of the dentries in the system are on a LRU and accounted in
sb->s_nr_dentry_unused of their owner superblock.

So rather than introduce a bunch of new infrastructure to track the
number of dentries allocated, why not simply limit the number of
dentries allowed on the LRU? We already track that, and the shrinker
already operates on the LRU, so we don't really need any new
infrastructure.

The limiting can be lazily - we don't need to limit the growth of
dentries until we start to run out of memory. If the superblock
shrinker is aware of the limits, then when it gets called by memory
reclaim it can do all the work of reducing the number of items on
the LRU down to the threshold at that time.

IOWs, the limit has no impact on performance until memory is scarce,
at which time memory reclaim enforces the limits on LRU size and
clean up happens automatically.

This also avoids all the problems of setting a limit lower than the
number of active dentries required for the workload (i.e. avoids
spurious ENOMEM errors trying to allocate dentries), allows
overcommitment when memory is plentiful (which will benefit
performance) but it brings the caches back to defined limits when
memory is not plentiful (which solves the problem you are having).

That seems like a much better solution to me, because it doesn't
impact the normal working of workloads in the containers and avoids
having to work out what the correct minimum size of the cache for
each workload is. It's much easier to work out how many extra cached
dentries you want the container to be able to have when memory is
scarce...

What do you think?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-17  5:43 ` Dave Chinner
@ 2011-08-17 18:44   ` Glauber Costa
  2011-08-17 18:44   ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-17 18:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 08/16/2011 10:43 PM, Dave Chinner wrote:
> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>> Hello,
>>
>> This series is just like v2, except it addresses
>> Eric's comments regarding percpu variables.
>>
>> Let me know if there are further comments, and
>> I'll promply address them as well. Otherwise,
>> I feel this is ready for inclusion

Hi David,

I am not answering everything now, since I'm travelling, but let me get 
to this one:

> Just out of couriousity, one thing I've noticed about dentries is
> that in general at any given point in time most dentries are unused.
> Under the workloads I'm testing, even when I have a million cached
> dentries, I only have roughly 7,000 accounted as used.  That is, most
> of the dentries in the system are on a LRU and accounted in
> sb->s_nr_dentry_unused of their owner superblock.
>
> So rather than introduce a bunch of new infrastructure to track the
> number of dentries allocated, why not simply limit the number of
> dentries allowed on the LRU? We already track that, and the shrinker
> already operates on the LRU, so we don't really need any new
> infrastructure.
Because this only works well for cooperative workloads. And we can't 
really assume that in the virtualization world. One container can come 
up with a bogus workload - not even hard to write - that has the sole 
purpose of punishing every resource sharer of him.

That's why we're putting limits on pinnable kernel memory. Normal
workloads won't do.

>
> The limiting can be lazily - we don't need to limit the growth of
> dentries until we start to run out of memory. If the superblock
> shrinker is aware of the limits, then when it gets called by memory
> reclaim it can do all the work of reducing the number of items on
> the LRU down to the threshold at that time.

Well, this idea itself can be considered, independent of which path 
we're taking. We can, if we want, allow the dentry cache to grow 
indefinitely if we're out of memory pressure. But it kinda defies the
purpose of a hard limit...

If we allow the dcache to grow over the hard cap just because memory is 
plentiful, then when it is *not* plentiful, we might be full of pinned 
entries that we can't release - and screw up somebody else's container, 
that behaved well all the time.

>
> IOWs, the limit has no impact on performance until memory is scarce,
> at which time memory reclaim enforces the limits on LRU size and
> clean up happens automatically.
>
> This also avoids all the problems of setting a limit lower than the
> number of active dentries required for the workload (i.e. avoids
> spurious ENOMEM errors trying to allocate dentries), allows
> overcommitment when memory is plentiful (which will benefit
> performance) but it brings the caches back to defined limits when
> memory is not plentiful (which solves the problem you are having).
No, this is not really the problem we're having.
See above.

About ENOMEM, I don't really see what's wrong with them here. For a 
container, running out of his assigned kernel memory, should be exactly 
the same as running out of real physical memory. I do agree that it 
changes the feeling of the system a little bit, because it then happens 
more often. But it is still right in principle.

>
> That seems like a much better solution to me, because it doesn't
> impact the normal working of workloads in the containers and avoids
> having to work out what the correct minimum size of the cache for
> each workload is. It's much easier to work out how many extra cached
> dentries you want the container to be able to have when memory is
> scarce...
>
> What do you think?

I think that for the reasons I went through above, dcache limiting and 
LRU size capping are really not equivalent in terms of solving this problem.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-17  5:43 ` Dave Chinner
  2011-08-17 18:44   ` Glauber Costa
@ 2011-08-17 18:44   ` Glauber Costa
  2011-08-18  1:27       ` Dave Chinner
       [not found]     ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-17 18:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On 08/16/2011 10:43 PM, Dave Chinner wrote:
> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>> Hello,
>>
>> This series is just like v2, except it addresses
>> Eric's comments regarding percpu variables.
>>
>> Let me know if there are further comments, and
>> I'll promply address them as well. Otherwise,
>> I feel this is ready for inclusion

Hi David,

I am not answering everything now, since I'm travelling, but let me get 
to this one:

> Just out of couriousity, one thing I've noticed about dentries is
> that in general at any given point in time most dentries are unused.
> Under the workloads I'm testing, even when I have a million cached
> dentries, I only have roughly 7,000 accounted as used.  That is, most
> of the dentries in the system are on a LRU and accounted in
> sb->s_nr_dentry_unused of their owner superblock.
>
> So rather than introduce a bunch of new infrastructure to track the
> number of dentries allocated, why not simply limit the number of
> dentries allowed on the LRU? We already track that, and the shrinker
> already operates on the LRU, so we don't really need any new
> infrastructure.
Because this only works well for cooperative workloads. And we can't 
really assume that in the virtualization world. One container can come 
up with a bogus workload - not even hard to write - that has the sole 
purpose of punishing every resource sharer of him.

That's why we're putting limits on pinnable kernel memory. Normal
workloads won't do.

>
> The limiting can be lazily - we don't need to limit the growth of
> dentries until we start to run out of memory. If the superblock
> shrinker is aware of the limits, then when it gets called by memory
> reclaim it can do all the work of reducing the number of items on
> the LRU down to the threshold at that time.

Well, this idea itself can be considered, independent of which path 
we're taking. We can, if we want, allow the dentry cache to grow 
indefinitely if we're out of memory pressure. But it kinda defies the
purpose of a hard limit...

If we allow the dcache to grow over the hard cap just because memory is 
plentiful, then when it is *not* plentiful, we might be full of pinned 
entries that we can't release - and screw up somebody else's container, 
that behaved well all the time.

>
> IOWs, the limit has no impact on performance until memory is scarce,
> at which time memory reclaim enforces the limits on LRU size and
> clean up happens automatically.
>
> This also avoids all the problems of setting a limit lower than the
> number of active dentries required for the workload (i.e. avoids
> spurious ENOMEM errors trying to allocate dentries), allows
> overcommitment when memory is plentiful (which will benefit
> performance) but it brings the caches back to defined limits when
> memory is not plentiful (which solves the problem you are having).
No, this is not really the problem we're having.
See above.

About ENOMEM, I don't really see what's wrong with them here. For a 
container, running out of his assigned kernel memory, should be exactly 
the same as running out of real physical memory. I do agree that it 
changes the feeling of the system a little bit, because it then happens 
more often. But it is still right in principle.

>
> That seems like a much better solution to me, because it doesn't
> impact the normal working of workloads in the containers and avoids
> having to work out what the correct minimum size of the cache for
> each workload is. It's much easier to work out how many extra cached
> dentries you want the container to be able to have when memory is
> scarce...
>
> What do you think?

I think that for the reasons I went through above, dcache limiting and 
LRU size capping are really not equivalent in terms of solving this problem.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
       [not found]     ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-08-18  1:27       ` Dave Chinner
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-18  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
> On 08/16/2011 10:43 PM, Dave Chinner wrote:
> >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
> >>Hello,
> >>
> >>This series is just like v2, except it addresses
> >>Eric's comments regarding percpu variables.
> >>
> >>Let me know if there are further comments, and
> >>I'll promply address them as well. Otherwise,
> >>I feel this is ready for inclusion
> 
> Hi David,
> 
> I am not answering everything now, since I'm travelling, but let me
> get to this one:
> 
> >Just out of couriousity, one thing I've noticed about dentries is
> >that in general at any given point in time most dentries are unused.
> >Under the workloads I'm testing, even when I have a million cached
> >dentries, I only have roughly 7,000 accounted as used.  That is, most
> >of the dentries in the system are on a LRU and accounted in
> >sb->s_nr_dentry_unused of their owner superblock.
> >
> >So rather than introduce a bunch of new infrastructure to track the
> >number of dentries allocated, why not simply limit the number of
> >dentries allowed on the LRU? We already track that, and the shrinker
> >already operates on the LRU, so we don't really need any new
> >infrastructure.
> Because this only works well for cooperative workloads. And we can't
> really assume that in the virtualization world. One container can
> come up with a bogus workload - not even hard to write - that has
> the sole purpose of punishing every resource sharer of him.

Sure, but as I've said before you can prevent the container from
consuming too many dentries (via a hard limit) simply by adding a
inode quota per container.  This is exactly the sort of
uncooperative behaviour filesystem quotas were invented to
prevent.

Perhaps we should separate the DOS case from the normal
(co-operative) use case.

As i mentioned previously, your inode allocation based DOS (while
(1); mkdir x; cd x; done type cases) example is trivial to prevent
with quotas. It was claimed that is was not possible to prevent with
filesystem quotas, I left proving that as an exercise for the
reader,but I feel I need to re-iterate my point with an example.

That is, if you can't create a million inodes in the container, you
can't instantiate a million dentries in the container.  For example,
use project quotas on XFS to create directory tree containers with
hard limits on the number of inodes:

$ cat /etc/projects 
12345:/mnt/scratch/projects/foo
$ cat /etc/projid 
foo:12345
$ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ mkdir -p /mnt/scratch/projects/foo
$ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
Setting up project foo (path /mnt/scratch/projects/foo)...
Setting up project foo (path /mnt/scratch/projects/foo)...
Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
$ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
$ sudo xfs_quota -c "quota -p foo" /mnt/scratch
$ cd /mnt/scratch/projects/foo/
$ ~/src/fs_mark-3.3/dir-depth 
count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x!
 /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x!
 /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
$ 

It stopped at 1435 directories because the container
(/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
there. And rather than a ENOMEM error (could be caused by
anything) the error is EDQUOT which is a clear indication that a
resource limit has been hit. That's a far better failure from a user
perspective because they know -exactly- why their application
failed - the container resource limits are too low....

IOWs, you don't need to touch the dentry cache -at all- to provide
the per-subtree hard resource limiting you are trying to acheive -
filesystem quotas can already acheive that for you. Project quotas
used in this manner (as directory tree quotas) provide exactly the
"per-subtree" hard resource limiting that you were trying to acheive
with your original dentry mobs proposal.

> >The limiting can be lazily - we don't need to limit the growth of
> >dentries until we start to run out of memory. If the superblock
> >shrinker is aware of the limits, then when it gets called by memory
> >reclaim it can do all the work of reducing the number of items on
> >the LRU down to the threshold at that time.
> 
> Well, this idea itself can be considered, independent of which path
> we're taking. We can, if we want, allow the dentry cache to grow
> indefinitely if we're out of memory pressure. But it kinda defies
> the
> purpose of a hard limit...

See my comments above about filesystem quotas providing hard limits.

> >IOWs, the limit has no impact on performance until memory is scarce,
> >at which time memory reclaim enforces the limits on LRU size and
> >clean up happens automatically.
> >
> >This also avoids all the problems of setting a limit lower than the
> >number of active dentries required for the workload (i.e. avoids
> >spurious ENOMEM errors trying to allocate dentries), allows
> >overcommitment when memory is plentiful (which will benefit
> >performance) but it brings the caches back to defined limits when
> >memory is not plentiful (which solves the problem you are having).
> No, this is not really the problem we're having.
> See above.
> 
> About ENOMEM, I don't really see what's wrong with them here.

Your backup program runs inside the container. Filesystem traversal
balloons the dentry cache footprint, and so it is likely to trigger
spurious ENOMEM when trying to read files in the container because
it can't allocate a dentry for random files as it traverses. That'll
be fun when it comes to restoring backups and discovering they
aren't complete....

There's also the "WTF caused the ENOMEM error" problem I mentioned
earlier....

> For a
> container, running out of his assigned kernel memory, should be
> exactly the same as running out of real physical memory. I do agree
> that it changes the feeling of the system a little bit, because it
> then happens more often. But it is still right in principle.

The difference is the degree - when the system runs out of memory,
it tries -really hard- before failing the allocation. Indeed, it'll
swap, it'll free memory in other subsystems, it'll back off on disk
congestion, it will try multiple times to free memory, escalation
priority each time it retries. IOws, it jumps through all sorts of
hoops to free memory before it finally fails. And then the memory,
more often than not, comes from some subsystem other than the dentry
cache, so it is rare that a dentry allocation actually relies on the
dentry cache (and only the dentry cache) being shrunk to provide
memory for the new dentry.

Your dentry hard limit is has no fallback or other mechanisms to try
- if the VFS caches cannot be shrunk immediately, then ENOMEM will
occur.  There's no retries, there's no waiting for disk congestion
to clear, there's no backoff, there's no increase in reclaim
desparation as previous attempts to free dentries fail. This greatly
increases the chances of ENOMEM from _d_alloc() way above when a
normal machine would see because it doesn't have any of the
functionality that memory reclaim has. And, fundamentally, that sort
of complexity doesn't belong in the dentry cache...

Another interesting case to consider is internally fragmented dentry
cache slabs, where the active population of the pages is sparse.
This sort of population density is quite common on machines with
sustained long term multiple workload usage (exactly what you'd
expect on a containerised system). Hence dentry allocation can be
done without increasing memory footprint at all. Likewise, freeing
dentries won't free any memory at all. In this case, what has your
hard limit bought you? An ENOMEM error in a situation where memory
allocation is actually free from a resource consumption perspective.

These are the sorts of corner case problems that hard limits on
cache sizes have. That's the  problem I see with the hard limit
approach: it looks simple, but it is full of corner cases when you
look more deeply. Users are going to hit these corner cases and
want to fix those "problems". We'll have to try to fix them without
really even being able to reproduce them reliably. We'll end up
growing heurisitics to try to detect when problems are about to
happen, and complexity to try to avoid those corner case problems.
We'll muddle along with something that sort of works for the cases
we can reproduce, but ultimately is untestable and unverifiable. In
contrast, a lazy lru limiting solution is simple to implement and
verify and has none of the warts that hard limiting exposes to user
applications.

Hence I'd prefer to avoid all the warts of hard limiting by ignoring
the DOS case that leads to requiring a hard limit as it can be
solved by other existing means. Limiting the size of the inactive
cache (generally dominates cache usage) seems like a much lower
impact manner of acheiving the same thing.

Like I said previously - I've had people asking me whether limiting
the size of the inode cache is possible for the past 5 years, and
all their use cases are solved by the lazy mechanism I described. I
think that most of the OpenVZ dcache size problems will also go away
with the lazy solution as well, as most workloads with a large
dentry cache footprint don't actively reference (and therefore pin)
the entire working set at the same time....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-17 18:44   ` Glauber Costa
@ 2011-08-18  1:27       ` Dave Chinner
       [not found]     ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-18  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
> On 08/16/2011 10:43 PM, Dave Chinner wrote:
> >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
> >>Hello,
> >>
> >>This series is just like v2, except it addresses
> >>Eric's comments regarding percpu variables.
> >>
> >>Let me know if there are further comments, and
> >>I'll promply address them as well. Otherwise,
> >>I feel this is ready for inclusion
> 
> Hi David,
> 
> I am not answering everything now, since I'm travelling, but let me
> get to this one:
> 
> >Just out of couriousity, one thing I've noticed about dentries is
> >that in general at any given point in time most dentries are unused.
> >Under the workloads I'm testing, even when I have a million cached
> >dentries, I only have roughly 7,000 accounted as used.  That is, most
> >of the dentries in the system are on a LRU and accounted in
> >sb->s_nr_dentry_unused of their owner superblock.
> >
> >So rather than introduce a bunch of new infrastructure to track the
> >number of dentries allocated, why not simply limit the number of
> >dentries allowed on the LRU? We already track that, and the shrinker
> >already operates on the LRU, so we don't really need any new
> >infrastructure.
> Because this only works well for cooperative workloads. And we can't
> really assume that in the virtualization world. One container can
> come up with a bogus workload - not even hard to write - that has
> the sole purpose of punishing every resource sharer of him.

Sure, but as I've said before you can prevent the container from
consuming too many dentries (via a hard limit) simply by adding a
inode quota per container.  This is exactly the sort of
uncooperative behaviour filesystem quotas were invented to
prevent.

Perhaps we should separate the DOS case from the normal
(co-operative) use case.

As i mentioned previously, your inode allocation based DOS (while
(1); mkdir x; cd x; done type cases) example is trivial to prevent
with quotas. It was claimed that is was not possible to prevent with
filesystem quotas, I left proving that as an exercise for the
reader,but I feel I need to re-iterate my point with an example.

That is, if you can't create a million inodes in the container, you
can't instantiate a million dentries in the container.  For example,
use project quotas on XFS to create directory tree containers with
hard limits on the number of inodes:

$ cat /etc/projects 
12345:/mnt/scratch/projects/foo
$ cat /etc/projid 
foo:12345
$ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ mkdir -p /mnt/scratch/projects/foo
$ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
Setting up project foo (path /mnt/scratch/projects/foo)...
Setting up project foo (path /mnt/scratch/projects/foo)...
Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
$ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
$ sudo xfs_quota -c "quota -p foo" /mnt/scratch
$ cd /mnt/scratch/projects/foo/
$ ~/src/fs_mark-3.3/dir-depth 
count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x!
 /x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
$ 

It stopped at 1435 directories because the container
(/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
there. And rather than a ENOMEM error (could be caused by
anything) the error is EDQUOT which is a clear indication that a
resource limit has been hit. That's a far better failure from a user
perspective because they know -exactly- why their application
failed - the container resource limits are too low....

IOWs, you don't need to touch the dentry cache -at all- to provide
the per-subtree hard resource limiting you are trying to acheive -
filesystem quotas can already acheive that for you. Project quotas
used in this manner (as directory tree quotas) provide exactly the
"per-subtree" hard resource limiting that you were trying to acheive
with your original dentry mobs proposal.

> >The limiting can be lazily - we don't need to limit the growth of
> >dentries until we start to run out of memory. If the superblock
> >shrinker is aware of the limits, then when it gets called by memory
> >reclaim it can do all the work of reducing the number of items on
> >the LRU down to the threshold at that time.
> 
> Well, this idea itself can be considered, independent of which path
> we're taking. We can, if we want, allow the dentry cache to grow
> indefinitely if we're out of memory pressure. But it kinda defies
> the
> purpose of a hard limit...

See my comments above about filesystem quotas providing hard limits.

> >IOWs, the limit has no impact on performance until memory is scarce,
> >at which time memory reclaim enforces the limits on LRU size and
> >clean up happens automatically.
> >
> >This also avoids all the problems of setting a limit lower than the
> >number of active dentries required for the workload (i.e. avoids
> >spurious ENOMEM errors trying to allocate dentries), allows
> >overcommitment when memory is plentiful (which will benefit
> >performance) but it brings the caches back to defined limits when
> >memory is not plentiful (which solves the problem you are having).
> No, this is not really the problem we're having.
> See above.
> 
> About ENOMEM, I don't really see what's wrong with them here.

Your backup program runs inside the container. Filesystem traversal
balloons the dentry cache footprint, and so it is likely to trigger
spurious ENOMEM when trying to read files in the container because
it can't allocate a dentry for random files as it traverses. That'll
be fun when it comes to restoring backups and discovering they
aren't complete....

There's also the "WTF caused the ENOMEM error" problem I mentioned
earlier....

> For a
> container, running out of his assigned kernel memory, should be
> exactly the same as running out of real physical memory. I do agree
> that it changes the feeling of the system a little bit, because it
> then happens more often. But it is still right in principle.

The difference is the degree - when the system runs out of memory,
it tries -really hard- before failing the allocation. Indeed, it'll
swap, it'll free memory in other subsystems, it'll back off on disk
congestion, it will try multiple times to free memory, escalation
priority each time it retries. IOws, it jumps through all sorts of
hoops to free memory before it finally fails. And then the memory,
more often than not, comes from some subsystem other than the dentry
cache, so it is rare that a dentry allocation actually relies on the
dentry cache (and only the dentry cache) being shrunk to provide
memory for the new dentry.

Your dentry hard limit is has no fallback or other mechanisms to try
- if the VFS caches cannot be shrunk immediately, then ENOMEM will
occur.  There's no retries, there's no waiting for disk congestion
to clear, there's no backoff, there's no increase in reclaim
desparation as previous attempts to free dentries fail. This greatly
increases the chances of ENOMEM from _d_alloc() way above when a
normal machine would see because it doesn't have any of the
functionality that memory reclaim has. And, fundamentally, that sort
of complexity doesn't belong in the dentry cache...

Another interesting case to consider is internally fragmented dentry
cache slabs, where the active population of the pages is sparse.
This sort of population density is quite common on machines with
sustained long term multiple workload usage (exactly what you'd
expect on a containerised system). Hence dentry allocation can be
done without increasing memory footprint at all. Likewise, freeing
dentries won't free any memory at all. In this case, what has your
hard limit bought you? An ENOMEM error in a situation where memory
allocation is actually free from a resource consumption perspective.

These are the sorts of corner case problems that hard limits on
cache sizes have. That's the  problem I see with the hard limit
approach: it looks simple, but it is full of corner cases when you
look more deeply. Users are going to hit these corner cases and
want to fix those "problems". We'll have to try to fix them without
really even being able to reproduce them reliably. We'll end up
growing heurisitics to try to detect when problems are about to
happen, and complexity to try to avoid those corner case problems.
We'll muddle along with something that sort of works for the cases
we can reproduce, but ultimately is untestable and unverifiable. In
contrast, a lazy lru limiting solution is simple to implement and
verify and has none of the warts that hard limiting exposes to user
applications.

Hence I'd prefer to avoid all the warts of hard limiting by ignoring
the DOS case that leads to requiring a hard limit as it can be
solved by other existing means. Limiting the size of the inactive
cache (generally dominates cache usage) seems like a much lower
impact manner of acheiving the same thing.

Like I said previously - I've had people asking me whether limiting
the size of the inode cache is possible for the past 5 years, and
all their use cases are solved by the lazy mechanism I described. I
think that most of the OpenVZ dcache size problems will also go away
with the lazy solution as well, as most workloads with a large
dentry cache footprint don't actively reference (and therefore pin)
the entire working set at the same time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
@ 2011-08-18  1:27       ` Dave Chinner
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Chinner @ 2011-08-18  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
> On 08/16/2011 10:43 PM, Dave Chinner wrote:
> >On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
> >>Hello,
> >>
> >>This series is just like v2, except it addresses
> >>Eric's comments regarding percpu variables.
> >>
> >>Let me know if there are further comments, and
> >>I'll promply address them as well. Otherwise,
> >>I feel this is ready for inclusion
> 
> Hi David,
> 
> I am not answering everything now, since I'm travelling, but let me
> get to this one:
> 
> >Just out of couriousity, one thing I've noticed about dentries is
> >that in general at any given point in time most dentries are unused.
> >Under the workloads I'm testing, even when I have a million cached
> >dentries, I only have roughly 7,000 accounted as used.  That is, most
> >of the dentries in the system are on a LRU and accounted in
> >sb->s_nr_dentry_unused of their owner superblock.
> >
> >So rather than introduce a bunch of new infrastructure to track the
> >number of dentries allocated, why not simply limit the number of
> >dentries allowed on the LRU? We already track that, and the shrinker
> >already operates on the LRU, so we don't really need any new
> >infrastructure.
> Because this only works well for cooperative workloads. And we can't
> really assume that in the virtualization world. One container can
> come up with a bogus workload - not even hard to write - that has
> the sole purpose of punishing every resource sharer of him.

Sure, but as I've said before you can prevent the container from
consuming too many dentries (via a hard limit) simply by adding a
inode quota per container.  This is exactly the sort of
uncooperative behaviour filesystem quotas were invented to
prevent.

Perhaps we should separate the DOS case from the normal
(co-operative) use case.

As i mentioned previously, your inode allocation based DOS (while
(1); mkdir x; cd x; done type cases) example is trivial to prevent
with quotas. It was claimed that is was not possible to prevent with
filesystem quotas, I left proving that as an exercise for the
reader,but I feel I need to re-iterate my point with an example.

That is, if you can't create a million inodes in the container, you
can't instantiate a million dentries in the container.  For example,
use project quotas on XFS to create directory tree containers with
hard limits on the number of inodes:

$ cat /etc/projects 
12345:/mnt/scratch/projects/foo
$ cat /etc/projid 
foo:12345
$ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ mkdir -p /mnt/scratch/projects/foo
$ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
Setting up project foo (path /mnt/scratch/projects/foo)...
Setting up project foo (path /mnt/scratch/projects/foo)...
Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
$ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
$ sudo xfs_quota -c "quota -p foo" /mnt/scratch
$ cd /mnt/scratch/projects/foo/
$ ~/src/fs_mark-3.3/dir-depth 
count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
$ 

It stopped at 1435 directories because the container
(/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
there. And rather than a ENOMEM error (could be caused by
anything) the error is EDQUOT which is a clear indication that a
resource limit has been hit. That's a far better failure from a user
perspective because they know -exactly- why their application
failed - the container resource limits are too low....

IOWs, you don't need to touch the dentry cache -at all- to provide
the per-subtree hard resource limiting you are trying to acheive -
filesystem quotas can already acheive that for you. Project quotas
used in this manner (as directory tree quotas) provide exactly the
"per-subtree" hard resource limiting that you were trying to acheive
with your original dentry mobs proposal.

> >The limiting can be lazily - we don't need to limit the growth of
> >dentries until we start to run out of memory. If the superblock
> >shrinker is aware of the limits, then when it gets called by memory
> >reclaim it can do all the work of reducing the number of items on
> >the LRU down to the threshold at that time.
> 
> Well, this idea itself can be considered, independent of which path
> we're taking. We can, if we want, allow the dentry cache to grow
> indefinitely if we're out of memory pressure. But it kinda defies
> the
> purpose of a hard limit...

See my comments above about filesystem quotas providing hard limits.

> >IOWs, the limit has no impact on performance until memory is scarce,
> >at which time memory reclaim enforces the limits on LRU size and
> >clean up happens automatically.
> >
> >This also avoids all the problems of setting a limit lower than the
> >number of active dentries required for the workload (i.e. avoids
> >spurious ENOMEM errors trying to allocate dentries), allows
> >overcommitment when memory is plentiful (which will benefit
> >performance) but it brings the caches back to defined limits when
> >memory is not plentiful (which solves the problem you are having).
> No, this is not really the problem we're having.
> See above.
> 
> About ENOMEM, I don't really see what's wrong with them here.

Your backup program runs inside the container. Filesystem traversal
balloons the dentry cache footprint, and so it is likely to trigger
spurious ENOMEM when trying to read files in the container because
it can't allocate a dentry for random files as it traverses. That'll
be fun when it comes to restoring backups and discovering they
aren't complete....

There's also the "WTF caused the ENOMEM error" problem I mentioned
earlier....

> For a
> container, running out of his assigned kernel memory, should be
> exactly the same as running out of real physical memory. I do agree
> that it changes the feeling of the system a little bit, because it
> then happens more often. But it is still right in principle.

The difference is the degree - when the system runs out of memory,
it tries -really hard- before failing the allocation. Indeed, it'll
swap, it'll free memory in other subsystems, it'll back off on disk
congestion, it will try multiple times to free memory, escalation
priority each time it retries. IOws, it jumps through all sorts of
hoops to free memory before it finally fails. And then the memory,
more often than not, comes from some subsystem other than the dentry
cache, so it is rare that a dentry allocation actually relies on the
dentry cache (and only the dentry cache) being shrunk to provide
memory for the new dentry.

Your dentry hard limit is has no fallback or other mechanisms to try
- if the VFS caches cannot be shrunk immediately, then ENOMEM will
occur.  There's no retries, there's no waiting for disk congestion
to clear, there's no backoff, there's no increase in reclaim
desparation as previous attempts to free dentries fail. This greatly
increases the chances of ENOMEM from _d_alloc() way above when a
normal machine would see because it doesn't have any of the
functionality that memory reclaim has. And, fundamentally, that sort
of complexity doesn't belong in the dentry cache...

Another interesting case to consider is internally fragmented dentry
cache slabs, where the active population of the pages is sparse.
This sort of population density is quite common on machines with
sustained long term multiple workload usage (exactly what you'd
expect on a containerised system). Hence dentry allocation can be
done without increasing memory footprint at all. Likewise, freeing
dentries won't free any memory at all. In this case, what has your
hard limit bought you? An ENOMEM error in a situation where memory
allocation is actually free from a resource consumption perspective.

These are the sorts of corner case problems that hard limits on
cache sizes have. That's the  problem I see with the hard limit
approach: it looks simple, but it is full of corner cases when you
look more deeply. Users are going to hit these corner cases and
want to fix those "problems". We'll have to try to fix them without
really even being able to reproduce them reliably. We'll end up
growing heurisitics to try to detect when problems are about to
happen, and complexity to try to avoid those corner case problems.
We'll muddle along with something that sort of works for the cases
we can reproduce, but ultimately is untestable and unverifiable. In
contrast, a lazy lru limiting solution is simple to implement and
verify and has none of the warts that hard limiting exposes to user
applications.

Hence I'd prefer to avoid all the warts of hard limiting by ignoring
the DOS case that leads to requiring a hard limit as it can be
solved by other existing means. Limiting the size of the inactive
cache (generally dominates cache usage) seems like a much lower
impact manner of acheiving the same thing.

Like I said previously - I've had people asking me whether limiting
the size of the inode cache is possible for the past 5 years, and
all their use cases are solved by the lazy mechanism I described. I
think that most of the OpenVZ dcache size problems will also go away
with the lazy solution as well, as most workloads with a large
dentry cache footprint don't actively reference (and therefore pin)
the entire working set at the same time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-18  1:27       ` Dave Chinner
  (?)
@ 2011-08-22 11:42       ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 08/17/2011 10:27 PM, Dave Chinner wrote:
> On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
>> On 08/16/2011 10:43 PM, Dave Chinner wrote:
>>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>>>> Hello,
>>>>
>>>> This series is just like v2, except it addresses
>>>> Eric's comments regarding percpu variables.
>>>>
>>>> Let me know if there are further comments, and
>>>> I'll promply address them as well. Otherwise,
>>>> I feel this is ready for inclusion
>>
>> Hi David,
>>
>> I am not answering everything now, since I'm travelling, but let me
>> get to this one:
>>
>>> Just out of couriousity, one thing I've noticed about dentries is
>>> that in general at any given point in time most dentries are unused.
>>> Under the workloads I'm testing, even when I have a million cached
>>> dentries, I only have roughly 7,000 accounted as used.  That is, most
>>> of the dentries in the system are on a LRU and accounted in
>>> sb->s_nr_dentry_unused of their owner superblock.
>>>
>>> So rather than introduce a bunch of new infrastructure to track the
>>> number of dentries allocated, why not simply limit the number of
>>> dentries allowed on the LRU? We already track that, and the shrinker
>>> already operates on the LRU, so we don't really need any new
>>> infrastructure.
>> Because this only works well for cooperative workloads. And we can't
>> really assume that in the virtualization world. One container can
>> come up with a bogus workload - not even hard to write - that has
>> the sole purpose of punishing every resource sharer of him.
>
> Sure, but as I've said before you can prevent the container from
> consuming too many dentries (via a hard limit) simply by adding a
> inode quota per container.  This is exactly the sort of
> uncooperative behaviour filesystem quotas were invented to
> prevent.
>
> Perhaps we should separate the DOS case from the normal
> (co-operative) use case.
>
> As i mentioned previously, your inode allocation based DOS (while
> (1); mkdir x; cd x; done type cases) example is trivial to prevent
> with quotas. It was claimed that is was not possible to prevent with
> filesystem quotas, I left proving that as an exercise for the
> reader,but I feel I need to re-iterate my point with an example.
>
> That is, if you can't create a million inodes in the container, you
> can't instantiate a million dentries in the container.  For example,
> use project quotas on XFS to create directory tree containers with
> hard limits on the number of inodes:

David,

The dentry -> inode relationship is a N:1 relationship. Therefore, it is
hard to believe that your example below would still work if we were 
trying to fill the cache through link operations, instead of operations 
like mkdir, that enforce a 1:1 relationship.

Caping the dentry numbers, OTOH, caps the # of inodes as well. Although 
we *do* can have inodes lying around in the caches without an associated 
dentry at some point in time, we cannot have inodes *pinned* into the 
cache without an associated dentry. So they will soon enough go away.

So maybe it is the other way around here, and it is people that wants an 
inode capping that should model it after a dentry cache capping.
>
> $ cat /etc/projects
> 12345:/mnt/scratch/projects/foo
> $ cat /etc/projid
> foo:12345
> $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
> $ mkdir -p /mnt/scratch/projects/foo
> $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
> $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
> $ sudo xfs_quota -c "quota -p foo" /mnt/scratch
> $ cd /mnt/scratch/projects/foo/
> $ ~/src/fs_mark-3.3/dir-depth
> count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
> $
>
> It stopped at 1435 directories because the container
> (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
> there.

As said above, mkdir enforces a 1:1 relationship (because directories 
can't be hard linked) that can't be guaranteed in the general case. For 
the general case, one can have a dentry cache bigger than Linus' ego 
while instantiating only one inode in the process.

> And rather than a ENOMEM error (could be caused by
> anything) the error is EDQUOT which is a clear indication that a
> resource limit has been hit. That's a far better failure from a user
> perspective because they know -exactly- why their application
> failed - the container resource limits are too low....
Well, if this is the problem, I am happy returning EDQUOT if we fail to 
find room for more dentries, or anything else we can agree upon instead 
of ENOMEM.

>
> IOWs, you don't need to touch the dentry cache -at all- to provide
> the per-subtree hard resource limiting you are trying to acheive -
> filesystem quotas can already acheive that for you. Project quotas
> used in this manner (as directory tree quotas) provide exactly the
> "per-subtree" hard resource limiting that you were trying to acheive
> with your original dentry mobs proposal.
well, I myself am done for now with the per-subtree proposal. I am 
completely fine with per-sb for a while now.

>
>>> The limiting can be lazily - we don't need to limit the growth of
>>> dentries until we start to run out of memory. If the superblock
>>> shrinker is aware of the limits, then when it gets called by memory
>>> reclaim it can do all the work of reducing the number of items on
>>> the LRU down to the threshold at that time.
>>
>> Well, this idea itself can be considered, independent of which path
>> we're taking. We can, if we want, allow the dentry cache to grow
>> indefinitely if we're out of memory pressure. But it kinda defies
>> the
>> purpose of a hard limit...
>
> See my comments above about filesystem quotas providing hard limits.
>
>>> IOWs, the limit has no impact on performance until memory is scarce,
>>> at which time memory reclaim enforces the limits on LRU size and
>>> clean up happens automatically.
>>>
>>> This also avoids all the problems of setting a limit lower than the
>>> number of active dentries required for the workload (i.e. avoids
>>> spurious ENOMEM errors trying to allocate dentries), allows
>>> overcommitment when memory is plentiful (which will benefit
>>> performance) but it brings the caches back to defined limits when
>>> memory is not plentiful (which solves the problem you are having).
>> No, this is not really the problem we're having.
>> See above.
>>
>> About ENOMEM, I don't really see what's wrong with them here.
>
> Your backup program runs inside the container. Filesystem traversal
> balloons the dentry cache footprint, and so it is likely to trigger
> spurious ENOMEM when trying to read files in the container because
> it can't allocate a dentry for random files as it traverses. That'll
> be fun when it comes to restoring backups and discovering they
> aren't complete....
>
> There's also the "WTF caused the ENOMEM error" problem I mentioned
> earlier....
>
>> For a
>> container, running out of his assigned kernel memory, should be
>> exactly the same as running out of real physical memory. I do agree
>> that it changes the feeling of the system a little bit, because it
>> then happens more often. But it is still right in principle.
>
> The difference is the degree - when the system runs out of memory,
> it tries -really hard- before failing the allocation. Indeed, it'll
> swap, it'll free memory in other subsystems, it'll back off on disk
> congestion, it will try multiple times to free memory, escalation
> priority each time it retries. IOws, it jumps through all sorts of
> hoops to free memory before it finally fails. And then the memory,
> more often than not, comes from some subsystem other than the dentry
> cache, so it is rare that a dentry allocation actually relies on the
> dentry cache (and only the dentry cache) being shrunk to provide
> memory for the new dentry.

This is apples to oranges comparison. If instead of using the mechanism 
I proposed, we go for a quota-based mechanism like you mentioned, we'll 
fail just as often. Just with EDQUOT instead of ENOMEM.

Humm.. while writing that I just looked back on the code, and it seems 
it will be hard to return anything but ENOMEM, since it is part of the 
interface contract. OTOH, the inode allocation function goes for the 
same kind of contract - returning NULL in case of error - meaning that 
everything of this kind that does not involve the filesystem will end up 
the same way. Be it in the icache, or dcache.

> Your dentry hard limit is has no fallback or other mechanisms to try
> - if the VFS caches cannot be shrunk immediately, then ENOMEM will
> occur.  There's no retries, there's no waiting for disk congestion
> to clear, there's no backoff, there's no increase in reclaim
> desparation as previous attempts to free dentries fail. This greatly
> increases the chances of ENOMEM from _d_alloc() way above when a
> normal machine would see because it doesn't have any of the
> functionality that memory reclaim has. And, fundamentally, that sort
> of complexity doesn't belong in the dentry cache...

I don't see how/why a user application should care. An error means "Hey 
Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, 
we tried really hard, but yet could not do it".

As far as a container is concerned, The only way to mimic the behavior 
you described would be to allow a single container to use up at most X 
bytes of general kernel memory. So when the dcache reaches the wall, it 
can borrow from somewhere else. Not that I am considering this...

>
> Another interesting case to consider is internally fragmented dentry
> cache slabs, where the active population of the pages is sparse.
> This sort of population density is quite common on machines with
> sustained long term multiple workload usage (exactly what you'd
> expect on a containerised system). Hence dentry allocation can be
> done without increasing memory footprint at all. Likewise, freeing
> dentries won't free any memory at all. In this case, what has your
> hard limit bought you? An ENOMEM error in a situation where memory
> allocation is actually free from a resource consumption perspective.

Again, I don't think those hard limits are about used memory. So if 
freeing a dentry may not free memory, I'm still fine with that. If the 
kernel as a whole needs memory later, it can do something to reclaim it.
*Unless* I hold a dentry reference. So the solution to me seems to be 
not allowing more than X to be held in the first place.

Again: I couldn't care less about how much *memory* it is actually using 
at a certain point in time.
>
> These are the sorts of corner case problems that hard limits on
> cache sizes have. That's the  problem I see with the hard limit
> approach: it looks simple, but it is full of corner cases when you
> look more deeply. Users are going to hit these corner cases and
> want to fix those "problems". We'll have to try to fix them without
> really even being able to reproduce them reliably. We'll end up
> growing heurisitics to try to detect when problems are about to
> happen, and complexity to try to avoid those corner case problems.
> We'll muddle along with something that sort of works for the cases
> we can reproduce, but ultimately is untestable and unverifiable. In
> contrast, a lazy lru limiting solution is simple to implement and
> verify and has none of the warts that hard limiting exposes to user
> applications.

What you described does not seem to me as a corner case.
"By using this option you can't use more than X entries, if you do, 
you'll fail" sounds pretty precise to me.

>
> Hence I'd prefer to avoid all the warts of hard limiting by ignoring
> the DOS case that leads to requiring a hard limit as it can be
> solved by other existing means. Limiting the size of the inactive
> cache (generally dominates cache usage) seems like a much lower
> impact manner of acheiving the same thing.

Again, I understand you, but I don't think we're really solving the same 
thing.

> Like I said previously - I've had people asking me whether limiting
> the size of the inode cache is possible for the past 5 years, and
> all their use cases are solved by the lazy mechanism I described. I
> think that most of the OpenVZ dcache size problems will also go away
> with the lazy solution as well, as most workloads with a large
> dentry cache footprint don't actively reference (and therefore pin)
> the entire working set at the same time....

Except for the malicious ones, of course.

> Cheers,

Thank you very much for your time, Dave!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
  2011-08-18  1:27       ` Dave Chinner
@ 2011-08-22 11:42         ` Glauber Costa
  -1 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On 08/17/2011 10:27 PM, Dave Chinner wrote:
> On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
>> On 08/16/2011 10:43 PM, Dave Chinner wrote:
>>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>>>> Hello,
>>>>
>>>> This series is just like v2, except it addresses
>>>> Eric's comments regarding percpu variables.
>>>>
>>>> Let me know if there are further comments, and
>>>> I'll promply address them as well. Otherwise,
>>>> I feel this is ready for inclusion
>>
>> Hi David,
>>
>> I am not answering everything now, since I'm travelling, but let me
>> get to this one:
>>
>>> Just out of couriousity, one thing I've noticed about dentries is
>>> that in general at any given point in time most dentries are unused.
>>> Under the workloads I'm testing, even when I have a million cached
>>> dentries, I only have roughly 7,000 accounted as used.  That is, most
>>> of the dentries in the system are on a LRU and accounted in
>>> sb->s_nr_dentry_unused of their owner superblock.
>>>
>>> So rather than introduce a bunch of new infrastructure to track the
>>> number of dentries allocated, why not simply limit the number of
>>> dentries allowed on the LRU? We already track that, and the shrinker
>>> already operates on the LRU, so we don't really need any new
>>> infrastructure.
>> Because this only works well for cooperative workloads. And we can't
>> really assume that in the virtualization world. One container can
>> come up with a bogus workload - not even hard to write - that has
>> the sole purpose of punishing every resource sharer of him.
>
> Sure, but as I've said before you can prevent the container from
> consuming too many dentries (via a hard limit) simply by adding a
> inode quota per container.  This is exactly the sort of
> uncooperative behaviour filesystem quotas were invented to
> prevent.
>
> Perhaps we should separate the DOS case from the normal
> (co-operative) use case.
>
> As i mentioned previously, your inode allocation based DOS (while
> (1); mkdir x; cd x; done type cases) example is trivial to prevent
> with quotas. It was claimed that is was not possible to prevent with
> filesystem quotas, I left proving that as an exercise for the
> reader,but I feel I need to re-iterate my point with an example.
>
> That is, if you can't create a million inodes in the container, you
> can't instantiate a million dentries in the container.  For example,
> use project quotas on XFS to create directory tree containers with
> hard limits on the number of inodes:

David,

The dentry -> inode relationship is a N:1 relationship. Therefore, it is
hard to believe that your example below would still work if we were 
trying to fill the cache through link operations, instead of operations 
like mkdir, that enforce a 1:1 relationship.

Caping the dentry numbers, OTOH, caps the # of inodes as well. Although 
we *do* can have inodes lying around in the caches without an associated 
dentry at some point in time, we cannot have inodes *pinned* into the 
cache without an associated dentry. So they will soon enough go away.

So maybe it is the other way around here, and it is people that wants an 
inode capping that should model it after a dentry cache capping.
>
> $ cat /etc/projects
> 12345:/mnt/scratch/projects/foo
> $ cat /etc/projid
> foo:12345
> $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
> $ mkdir -p /mnt/scratch/projects/foo
> $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
> $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
> $ sudo xfs_quota -c "quota -p foo" /mnt/scratch
> $ cd /mnt/scratch/projects/foo/
> $ ~/src/fs_mark-3.3/dir-depth
> count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
> $
>
> It stopped at 1435 directories because the container
> (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
> there.

As said above, mkdir enforces a 1:1 relationship (because directories 
can't be hard linked) that can't be guaranteed in the general case. For 
the general case, one can have a dentry cache bigger than Linus' ego 
while instantiating only one inode in the process.

> And rather than a ENOMEM error (could be caused by
> anything) the error is EDQUOT which is a clear indication that a
> resource limit has been hit. That's a far better failure from a user
> perspective because they know -exactly- why their application
> failed - the container resource limits are too low....
Well, if this is the problem, I am happy returning EDQUOT if we fail to 
find room for more dentries, or anything else we can agree upon instead 
of ENOMEM.

>
> IOWs, you don't need to touch the dentry cache -at all- to provide
> the per-subtree hard resource limiting you are trying to acheive -
> filesystem quotas can already acheive that for you. Project quotas
> used in this manner (as directory tree quotas) provide exactly the
> "per-subtree" hard resource limiting that you were trying to acheive
> with your original dentry mobs proposal.
well, I myself am done for now with the per-subtree proposal. I am 
completely fine with per-sb for a while now.

>
>>> The limiting can be lazily - we don't need to limit the growth of
>>> dentries until we start to run out of memory. If the superblock
>>> shrinker is aware of the limits, then when it gets called by memory
>>> reclaim it can do all the work of reducing the number of items on
>>> the LRU down to the threshold at that time.
>>
>> Well, this idea itself can be considered, independent of which path
>> we're taking. We can, if we want, allow the dentry cache to grow
>> indefinitely if we're out of memory pressure. But it kinda defies
>> the
>> purpose of a hard limit...
>
> See my comments above about filesystem quotas providing hard limits.
>
>>> IOWs, the limit has no impact on performance until memory is scarce,
>>> at which time memory reclaim enforces the limits on LRU size and
>>> clean up happens automatically.
>>>
>>> This also avoids all the problems of setting a limit lower than the
>>> number of active dentries required for the workload (i.e. avoids
>>> spurious ENOMEM errors trying to allocate dentries), allows
>>> overcommitment when memory is plentiful (which will benefit
>>> performance) but it brings the caches back to defined limits when
>>> memory is not plentiful (which solves the problem you are having).
>> No, this is not really the problem we're having.
>> See above.
>>
>> About ENOMEM, I don't really see what's wrong with them here.
>
> Your backup program runs inside the container. Filesystem traversal
> balloons the dentry cache footprint, and so it is likely to trigger
> spurious ENOMEM when trying to read files in the container because
> it can't allocate a dentry for random files as it traverses. That'll
> be fun when it comes to restoring backups and discovering they
> aren't complete....
>
> There's also the "WTF caused the ENOMEM error" problem I mentioned
> earlier....
>
>> For a
>> container, running out of his assigned kernel memory, should be
>> exactly the same as running out of real physical memory. I do agree
>> that it changes the feeling of the system a little bit, because it
>> then happens more often. But it is still right in principle.
>
> The difference is the degree - when the system runs out of memory,
> it tries -really hard- before failing the allocation. Indeed, it'll
> swap, it'll free memory in other subsystems, it'll back off on disk
> congestion, it will try multiple times to free memory, escalation
> priority each time it retries. IOws, it jumps through all sorts of
> hoops to free memory before it finally fails. And then the memory,
> more often than not, comes from some subsystem other than the dentry
> cache, so it is rare that a dentry allocation actually relies on the
> dentry cache (and only the dentry cache) being shrunk to provide
> memory for the new dentry.

This is apples to oranges comparison. If instead of using the mechanism 
I proposed, we go for a quota-based mechanism like you mentioned, we'll 
fail just as often. Just with EDQUOT instead of ENOMEM.

Humm.. while writing that I just looked back on the code, and it seems 
it will be hard to return anything but ENOMEM, since it is part of the 
interface contract. OTOH, the inode allocation function goes for the 
same kind of contract - returning NULL in case of error - meaning that 
everything of this kind that does not involve the filesystem will end up 
the same way. Be it in the icache, or dcache.

> Your dentry hard limit is has no fallback or other mechanisms to try
> - if the VFS caches cannot be shrunk immediately, then ENOMEM will
> occur.  There's no retries, there's no waiting for disk congestion
> to clear, there's no backoff, there's no increase in reclaim
> desparation as previous attempts to free dentries fail. This greatly
> increases the chances of ENOMEM from _d_alloc() way above when a
> normal machine would see because it doesn't have any of the
> functionality that memory reclaim has. And, fundamentally, that sort
> of complexity doesn't belong in the dentry cache...

I don't see how/why a user application should care. An error means "Hey 
Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, 
we tried really hard, but yet could not do it".

As far as a container is concerned, The only way to mimic the behavior 
you described would be to allow a single container to use up at most X 
bytes of general kernel memory. So when the dcache reaches the wall, it 
can borrow from somewhere else. Not that I am considering this...

>
> Another interesting case to consider is internally fragmented dentry
> cache slabs, where the active population of the pages is sparse.
> This sort of population density is quite common on machines with
> sustained long term multiple workload usage (exactly what you'd
> expect on a containerised system). Hence dentry allocation can be
> done without increasing memory footprint at all. Likewise, freeing
> dentries won't free any memory at all. In this case, what has your
> hard limit bought you? An ENOMEM error in a situation where memory
> allocation is actually free from a resource consumption perspective.

Again, I don't think those hard limits are about used memory. So if 
freeing a dentry may not free memory, I'm still fine with that. If the 
kernel as a whole needs memory later, it can do something to reclaim it.
*Unless* I hold a dentry reference. So the solution to me seems to be 
not allowing more than X to be held in the first place.

Again: I couldn't care less about how much *memory* it is actually using 
at a certain point in time.
>
> These are the sorts of corner case problems that hard limits on
> cache sizes have. That's the  problem I see with the hard limit
> approach: it looks simple, but it is full of corner cases when you
> look more deeply. Users are going to hit these corner cases and
> want to fix those "problems". We'll have to try to fix them without
> really even being able to reproduce them reliably. We'll end up
> growing heurisitics to try to detect when problems are about to
> happen, and complexity to try to avoid those corner case problems.
> We'll muddle along with something that sort of works for the cases
> we can reproduce, but ultimately is untestable and unverifiable. In
> contrast, a lazy lru limiting solution is simple to implement and
> verify and has none of the warts that hard limiting exposes to user
> applications.

What you described does not seem to me as a corner case.
"By using this option you can't use more than X entries, if you do, 
you'll fail" sounds pretty precise to me.

>
> Hence I'd prefer to avoid all the warts of hard limiting by ignoring
> the DOS case that leads to requiring a hard limit as it can be
> solved by other existing means. Limiting the size of the inactive
> cache (generally dominates cache usage) seems like a much lower
> impact manner of acheiving the same thing.

Again, I understand you, but I don't think we're really solving the same 
thing.

> Like I said previously - I've had people asking me whether limiting
> the size of the inode cache is possible for the past 5 years, and
> all their use cases are solved by the lazy mechanism I described. I
> think that most of the OpenVZ dcache size problems will also go away
> with the lazy solution as well, as most workloads with a large
> dentry cache footprint don't actively reference (and therefore pin)
> the entire working set at the same time....

Except for the malicious ones, of course.

> Cheers,

Thank you very much for your time, Dave!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 0/4] Per-container dcache limitation
@ 2011-08-22 11:42         ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-22 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley

On 08/17/2011 10:27 PM, Dave Chinner wrote:
> On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
>> On 08/16/2011 10:43 PM, Dave Chinner wrote:
>>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>>>> Hello,
>>>>
>>>> This series is just like v2, except it addresses
>>>> Eric's comments regarding percpu variables.
>>>>
>>>> Let me know if there are further comments, and
>>>> I'll promply address them as well. Otherwise,
>>>> I feel this is ready for inclusion
>>
>> Hi David,
>>
>> I am not answering everything now, since I'm travelling, but let me
>> get to this one:
>>
>>> Just out of couriousity, one thing I've noticed about dentries is
>>> that in general at any given point in time most dentries are unused.
>>> Under the workloads I'm testing, even when I have a million cached
>>> dentries, I only have roughly 7,000 accounted as used.  That is, most
>>> of the dentries in the system are on a LRU and accounted in
>>> sb->s_nr_dentry_unused of their owner superblock.
>>>
>>> So rather than introduce a bunch of new infrastructure to track the
>>> number of dentries allocated, why not simply limit the number of
>>> dentries allowed on the LRU? We already track that, and the shrinker
>>> already operates on the LRU, so we don't really need any new
>>> infrastructure.
>> Because this only works well for cooperative workloads. And we can't
>> really assume that in the virtualization world. One container can
>> come up with a bogus workload - not even hard to write - that has
>> the sole purpose of punishing every resource sharer of him.
>
> Sure, but as I've said before you can prevent the container from
> consuming too many dentries (via a hard limit) simply by adding a
> inode quota per container.  This is exactly the sort of
> uncooperative behaviour filesystem quotas were invented to
> prevent.
>
> Perhaps we should separate the DOS case from the normal
> (co-operative) use case.
>
> As i mentioned previously, your inode allocation based DOS (while
> (1); mkdir x; cd x; done type cases) example is trivial to prevent
> with quotas. It was claimed that is was not possible to prevent with
> filesystem quotas, I left proving that as an exercise for the
> reader,but I feel I need to re-iterate my point with an example.
>
> That is, if you can't create a million inodes in the container, you
> can't instantiate a million dentries in the container.  For example,
> use project quotas on XFS to create directory tree containers with
> hard limits on the number of inodes:

David,

The dentry -> inode relationship is a N:1 relationship. Therefore, it is
hard to believe that your example below would still work if we were 
trying to fill the cache through link operations, instead of operations 
like mkdir, that enforce a 1:1 relationship.

Caping the dentry numbers, OTOH, caps the # of inodes as well. Although 
we *do* can have inodes lying around in the caches without an associated 
dentry at some point in time, we cannot have inodes *pinned* into the 
cache without an associated dentry. So they will soon enough go away.

So maybe it is the other way around here, and it is people that wants an 
inode capping that should model it after a dentry cache capping.
>
> $ cat /etc/projects
> 12345:/mnt/scratch/projects/foo
> $ cat /etc/projid
> foo:12345
> $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
> $ mkdir -p /mnt/scratch/projects/foo
> $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
> $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
> $ sudo xfs_quota -c "quota -p foo" /mnt/scratch
> $ cd /mnt/scratch/projects/foo/
> $ ~/src/fs_mark-3.3/dir-depth
> count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
 x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
> $
>
> It stopped at 1435 directories because the container
> (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
> there.

As said above, mkdir enforces a 1:1 relationship (because directories 
can't be hard linked) that can't be guaranteed in the general case. For 
the general case, one can have a dentry cache bigger than Linus' ego 
while instantiating only one inode in the process.

> And rather than a ENOMEM error (could be caused by
> anything) the error is EDQUOT which is a clear indication that a
> resource limit has been hit. That's a far better failure from a user
> perspective because they know -exactly- why their application
> failed - the container resource limits are too low....
Well, if this is the problem, I am happy returning EDQUOT if we fail to 
find room for more dentries, or anything else we can agree upon instead 
of ENOMEM.

>
> IOWs, you don't need to touch the dentry cache -at all- to provide
> the per-subtree hard resource limiting you are trying to acheive -
> filesystem quotas can already acheive that for you. Project quotas
> used in this manner (as directory tree quotas) provide exactly the
> "per-subtree" hard resource limiting that you were trying to acheive
> with your original dentry mobs proposal.
well, I myself am done for now with the per-subtree proposal. I am 
completely fine with per-sb for a while now.

>
>>> The limiting can be lazily - we don't need to limit the growth of
>>> dentries until we start to run out of memory. If the superblock
>>> shrinker is aware of the limits, then when it gets called by memory
>>> reclaim it can do all the work of reducing the number of items on
>>> the LRU down to the threshold at that time.
>>
>> Well, this idea itself can be considered, independent of which path
>> we're taking. We can, if we want, allow the dentry cache to grow
>> indefinitely if we're out of memory pressure. But it kinda defies
>> the
>> purpose of a hard limit...
>
> See my comments above about filesystem quotas providing hard limits.
>
>>> IOWs, the limit has no impact on performance until memory is scarce,
>>> at which time memory reclaim enforces the limits on LRU size and
>>> clean up happens automatically.
>>>
>>> This also avoids all the problems of setting a limit lower than the
>>> number of active dentries required for the workload (i.e. avoids
>>> spurious ENOMEM errors trying to allocate dentries), allows
>>> overcommitment when memory is plentiful (which will benefit
>>> performance) but it brings the caches back to defined limits when
>>> memory is not plentiful (which solves the problem you are having).
>> No, this is not really the problem we're having.
>> See above.
>>
>> About ENOMEM, I don't really see what's wrong with them here.
>
> Your backup program runs inside the container. Filesystem traversal
> balloons the dentry cache footprint, and so it is likely to trigger
> spurious ENOMEM when trying to read files in the container because
> it can't allocate a dentry for random files as it traverses. That'll
> be fun when it comes to restoring backups and discovering they
> aren't complete....
>
> There's also the "WTF caused the ENOMEM error" problem I mentioned
> earlier....
>
>> For a
>> container, running out of his assigned kernel memory, should be
>> exactly the same as running out of real physical memory. I do agree
>> that it changes the feeling of the system a little bit, because it
>> then happens more often. But it is still right in principle.
>
> The difference is the degree - when the system runs out of memory,
> it tries -really hard- before failing the allocation. Indeed, it'll
> swap, it'll free memory in other subsystems, it'll back off on disk
> congestion, it will try multiple times to free memory, escalation
> priority each time it retries. IOws, it jumps through all sorts of
> hoops to free memory before it finally fails. And then the memory,
> more often than not, comes from some subsystem other than the dentry
> cache, so it is rare that a dentry allocation actually relies on the
> dentry cache (and only the dentry cache) being shrunk to provide
> memory for the new dentry.

This is apples to oranges comparison. If instead of using the mechanism 
I proposed, we go for a quota-based mechanism like you mentioned, we'll 
fail just as often. Just with EDQUOT instead of ENOMEM.

Humm.. while writing that I just looked back on the code, and it seems 
it will be hard to return anything but ENOMEM, since it is part of the 
interface contract. OTOH, the inode allocation function goes for the 
same kind of contract - returning NULL in case of error - meaning that 
everything of this kind that does not involve the filesystem will end up 
the same way. Be it in the icache, or dcache.

> Your dentry hard limit is has no fallback or other mechanisms to try
> - if the VFS caches cannot be shrunk immediately, then ENOMEM will
> occur.  There's no retries, there's no waiting for disk congestion
> to clear, there's no backoff, there's no increase in reclaim
> desparation as previous attempts to free dentries fail. This greatly
> increases the chances of ENOMEM from _d_alloc() way above when a
> normal machine would see because it doesn't have any of the
> functionality that memory reclaim has. And, fundamentally, that sort
> of complexity doesn't belong in the dentry cache...

I don't see how/why a user application should care. An error means "Hey 
Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, 
we tried really hard, but yet could not do it".

As far as a container is concerned, The only way to mimic the behavior 
you described would be to allow a single container to use up at most X 
bytes of general kernel memory. So when the dcache reaches the wall, it 
can borrow from somewhere else. Not that I am considering this...

>
> Another interesting case to consider is internally fragmented dentry
> cache slabs, where the active population of the pages is sparse.
> This sort of population density is quite common on machines with
> sustained long term multiple workload usage (exactly what you'd
> expect on a containerised system). Hence dentry allocation can be
> done without increasing memory footprint at all. Likewise, freeing
> dentries won't free any memory at all. In this case, what has your
> hard limit bought you? An ENOMEM error in a situation where memory
> allocation is actually free from a resource consumption perspective.

Again, I don't think those hard limits are about used memory. So if 
freeing a dentry may not free memory, I'm still fine with that. If the 
kernel as a whole needs memory later, it can do something to reclaim it.
*Unless* I hold a dentry reference. So the solution to me seems to be 
not allowing more than X to be held in the first place.

Again: I couldn't care less about how much *memory* it is actually using 
at a certain point in time.
>
> These are the sorts of corner case problems that hard limits on
> cache sizes have. That's the  problem I see with the hard limit
> approach: it looks simple, but it is full of corner cases when you
> look more deeply. Users are going to hit these corner cases and
> want to fix those "problems". We'll have to try to fix them without
> really even being able to reproduce them reliably. We'll end up
> growing heurisitics to try to detect when problems are about to
> happen, and complexity to try to avoid those corner case problems.
> We'll muddle along with something that sort of works for the cases
> we can reproduce, but ultimately is untestable and unverifiable. In
> contrast, a lazy lru limiting solution is simple to implement and
> verify and has none of the warts that hard limiting exposes to user
> applications.

What you described does not seem to me as a corner case.
"By using this option you can't use more than X entries, if you do, 
you'll fail" sounds pretty precise to me.

>
> Hence I'd prefer to avoid all the warts of hard limiting by ignoring
> the DOS case that leads to requiring a hard limit as it can be
> solved by other existing means. Limiting the size of the inactive
> cache (generally dominates cache usage) seems like a much lower
> impact manner of acheiving the same thing.

Again, I understand you, but I don't think we're really solving the same 
thing.

> Like I said previously - I've had people asking me whether limiting
> the size of the inode cache is possible for the past 5 years, and
> all their use cases are solved by the lazy mechanism I described. I
> think that most of the OpenVZ dcache size problems will also go away
> with the lazy solution as well, as most workloads with a large
> dentry cache footprint don't actively reference (and therefore pin)
> the entire working set at the same time....

Except for the malicious ones, of course.

> Cheers,

Thank you very much for your time, Dave!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-15  7:09     ` Dave Chinner
  2011-08-24  2:19       ` Glauber Costa
@ 2011-08-24  2:19       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-24  2:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrea Arcangeli, Rik van Riel, Nick Piggin, Eric Dumazet,
	Pavel Emelyanov,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Hugh Dickins, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	James Bottomley, Dave Hansen, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 08/15/2011 04:09 AM, Dave Chinner wrote:
> On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote:
>> This patch introduces a simple generic vfs option parser.
>> Right now, the only option we have is to limit the size of the dcache.
>>
>> So any user that wants to have a dcache entries limit, can specify:
>>
>>    mount -o whatever_options,vfs_dcache_size=XXX<dev>  <mntpoint>
>>
>> It is supposed to work well with remounts, allowing it to change
>> multiple over the course of the filesystem's lifecycle.
>>
>> I find mount a natural interface for handling filesystem options,
>> so that's what I've choosen. Feel free to yell at it at will if
>> you disagree.
>
> It's already been noted by both myself and Al that there is not a
> 1:1 mapping between mount point and superblocks....

I had skipped that one in my pending patches.

Well, it's already been noted by me as well. And by anybody.
The problem, however, is not new. Filesystems exports fs-specific
options that operate on the superblock as well. And we've been living 
alright with them.

>> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
>>   	else
>>   		retval = do_new_mount(&path, type_page, flags, mnt_flags,
>>   				      dev_name, data_page);
>> +
>> +	/* bind mounts get to respect their parents decision */
>> +	if (!retval&&  !(flags&  MS_BIND))
>> +		vfs_set_dcache_size(path.mnt->mnt_sb,
>> +				    vfs_options.vfs_dcache_size);
>> +			
>
> And I'm not sure that silently ignoring it in certain cases
> is the best way around that problem.

It is not "in certain cases". It is : 1) When bind mounting, and 2) When 
an error occurs. Pretty well-defined.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 4/4] parse options in the vfs level
  2011-08-15  7:09     ` Dave Chinner
@ 2011-08-24  2:19       ` Glauber Costa
  2011-08-24  2:19       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2011-08-24  2:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, containers, Pavel Emelyanov,
	Al Viro, Hugh Dickins, Nick Piggin, Andrea Arcangeli,
	Rik van Riel, Dave Hansen, James Bottomley, Eric Dumazet

On 08/15/2011 04:09 AM, Dave Chinner wrote:
> On Sun, Aug 14, 2011 at 07:13:52PM +0400, Glauber Costa wrote:
>> This patch introduces a simple generic vfs option parser.
>> Right now, the only option we have is to limit the size of the dcache.
>>
>> So any user that wants to have a dcache entries limit, can specify:
>>
>>    mount -o whatever_options,vfs_dcache_size=XXX<dev>  <mntpoint>
>>
>> It is supposed to work well with remounts, allowing it to change
>> multiple over the course of the filesystem's lifecycle.
>>
>> I find mount a natural interface for handling filesystem options,
>> so that's what I've choosen. Feel free to yell at it at will if
>> you disagree.
>
> It's already been noted by both myself and Al that there is not a
> 1:1 mapping between mount point and superblocks....

I had skipped that one in my pending patches.

Well, it's already been noted by me as well. And by anybody.
The problem, however, is not new. Filesystems exports fs-specific
options that operate on the superblock as well. And we've been living 
alright with them.

>> @@ -2350,6 +2449,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
>>   	else
>>   		retval = do_new_mount(&path, type_page, flags, mnt_flags,
>>   				      dev_name, data_page);
>> +
>> +	/* bind mounts get to respect their parents decision */
>> +	if (!retval&&  !(flags&  MS_BIND))
>> +		vfs_set_dcache_size(path.mnt->mnt_sb,
>> +				    vfs_options.vfs_dcache_size);
>> +			
>
> And I'm not sure that silently ignoring it in certain cases
> is the best way around that problem.

It is not "in certain cases". It is : 1) When bind mounting, and 2) When 
an error occurs. Pretty well-defined.

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2011-08-24  2:20 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-14 15:13 [PATCH v3 0/4] Per-container dcache limitation Glauber Costa
2011-08-14 15:13 ` Glauber Costa
     [not found] ` <1313334832-1150-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-14 15:13   ` [PATCH v3 1/4] factor out single-shrinker code Glauber Costa
2011-08-14 15:13     ` Glauber Costa
2011-08-15  6:43     ` Dave Chinner
     [not found]     ` <1313334832-1150-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-15  6:43       ` Dave Chinner
2011-08-14 15:13   ` [PATCH v3 2/4] Keep nr_dentry per super block Glauber Costa
2011-08-14 15:13     ` Glauber Costa
2011-08-14 17:38     ` Eric Dumazet
2011-08-14 17:38       ` Eric Dumazet
     [not found]     ` <1313334832-1150-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-14 17:38       ` Eric Dumazet
2011-08-14 15:13   ` [PATCH v3 3/4] limit nr_dentries per superblock Glauber Costa
2011-08-14 15:13     ` Glauber Costa
     [not found]     ` <1313334832-1150-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-15  7:03       ` Dave Chinner
2011-08-15  7:12       ` Pekka Enberg
2011-08-15  7:03     ` Dave Chinner
2011-08-15  7:12     ` Pekka Enberg
     [not found]       ` <CAOJsxLFYD9DkEU5R+9fZr-uaznteP8-BUC_Ti+FnNFqtbOCHSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-08-15 10:46         ` Dave Chinner
2011-08-15 10:46       ` Dave Chinner
2011-08-15 10:58         ` Pekka Enberg
     [not found]           ` <CAOJsxLGFJmqO-W=itQbO4Mh4DxSD4wrHOC8gQ5bWL5aE1YYeQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-08-15 11:05             ` Pavel Emelyanov
2011-08-15 11:05           ` Pavel Emelyanov
2011-08-15 11:05             ` Pavel Emelyanov
2011-08-15 11:14             ` Pekka Enberg
2011-08-15 11:14               ` Pekka Enberg
2011-08-15 11:32               ` Pavel Emelyanov
2011-08-15 11:32                 ` Pavel Emelyanov
2011-08-15 11:55                 ` Pekka Enberg
2011-08-15 11:55                   ` Pekka Enberg
2011-08-15 12:12                   ` Pavel Emelyanov
2011-08-15 12:12                     ` Pavel Emelyanov
2011-08-15 12:23                     ` Pekka Enberg
2011-08-15 12:23                       ` Pekka Enberg
2011-08-15 12:37                       ` Pavel Emelyanov
2011-08-15 12:37                         ` Pavel Emelyanov
     [not found]                       ` <CAOJsxLEVfGrzUQv0fOpOyw3AaOLOHcWbvLJL1NdrHS6M2j5o1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-08-15 12:37                         ` Pavel Emelyanov
     [not found]                     ` <4E490D47.8050105-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-15 12:23                       ` Pekka Enberg
     [not found]                   ` <CAOJsxLErFcxuuqnWkRbOAHEFbeGrKp3YMZ-144=m16oBXCHJ2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-08-15 12:12                     ` Pavel Emelyanov
     [not found]                 ` <4E4903C1.9050207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-15 11:55                   ` Pekka Enberg
     [not found]               ` <CAOJsxLFjZffC9i5kQsBWjGy3eWdj3VMB5awz=yPcwgeKb5BG0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-08-15 11:32                 ` Pavel Emelyanov
2011-08-16  2:11                 ` Dave Chinner
2011-08-16  2:11               ` Dave Chinner
2011-08-16  2:11                 ` Dave Chinner
     [not found]             ` <4E48FD8A.90401-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-15 11:14               ` Pekka Enberg
2011-08-15 10:58         ` Pekka Enberg
2011-08-14 15:13   ` [PATCH v3 4/4] parse options in the vfs level Glauber Costa
2011-08-14 15:13     ` Glauber Costa
2011-08-14 15:39     ` Vasiliy Kulikov
2011-08-15  0:03       ` Glauber Costa
2011-08-15  0:03       ` Glauber Costa
     [not found]     ` <1313334832-1150-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-14 15:39       ` Vasiliy Kulikov
2011-08-15  7:09       ` Dave Chinner
2011-08-15  7:09     ` Dave Chinner
2011-08-24  2:19       ` Glauber Costa
2011-08-24  2:19       ` Glauber Costa
2011-08-17  5:43   ` [PATCH v3 0/4] Per-container dcache limitation Dave Chinner
2011-08-17  5:43 ` Dave Chinner
2011-08-17 18:44   ` Glauber Costa
2011-08-17 18:44   ` Glauber Costa
2011-08-18  1:27     ` Dave Chinner
2011-08-18  1:27       ` Dave Chinner
2011-08-22 11:42       ` Glauber Costa
2011-08-22 11:42       ` Glauber Costa
2011-08-22 11:42         ` Glauber Costa
     [not found]     ` <4E4C0C25.3-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-08-18  1:27       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.