Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/11] fs/dcache: Limit # of negative dentries
@ 2020-02-26 16:13 Waiman Long
  2020-02-26 16:13 ` [PATCH 01/11] fs/dcache: Fix incorrect accounting " Waiman Long
                   ` (13 more replies)
  0 siblings, 14 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

As there is no limit for negative dentries, it is possible that a sizeable
portion of system memory can be tied up in dentry cache slabs. Dentry slabs
are generally recalimable if the dentries are in the LRUs. Still having
too much memory used up by dentries can be problematic:

 1) When a filesystem with too many negative dentries is being unmounted,
    the process of draining the dentries associated with the filesystem
    can take some time. To users, the system may seem to hang for
    a while.  The long wait may also cause unexpected timeout error or
    other warnings.  This can happen when a long running container with
    many negative dentries is being destroyed, for instance.

 2) Tying up too much memory in unused negative dentries means there
    are less memory available for other use. Even though the kernel is
    able to reclaim unused dentries when running out of free memory,
    it will still introduce additional latency to the application
    reducing its performance.

There are two different approaches to limit negative dentries.

  1) Global reclaim
     Based on the total number of negative dentries as tracked by the
     nr_dentry_negative percpu count, a function can be activated to
     scan the various LRU lists to trim out excess negative dentries.

  2) Local reclaim
     By tracking the number of negative dentries under each directory,
     we can start the reclaim process if the number exceeds a certain
     limit.

The problem with global reclaim is that there are just too many LRU lists
present that may need to be scanned for each filesystem. Especially
problematic is the fact that each memory cgroup can have its own LRU
lists. As memory cgroup can come and go at any time, scanning its LRUs
can be tricky.

Local reclaim does not have this problem. So it is used as the basis
for negative dentry reclaim for this patchset. Accurately tracking the
number of negative dentries in each directory can be costly in term of
performance hit. As a result, this patchset estimates the number of
negative dentries present in a directory by looking at a newly added
children count and an opportunistically stored positive dentry count.

A new sysctl parameter "dentry-dir-max" is introduced which accepts a
value of 0 (default) for no limit or a positive integer 256 and up. Small
dentry-dir-max numbers are forbidden to avoid excessive dentry count
checking which can impact system performance.

The actual negative dentry reclaim is delegated to the system workqueue
to avoid adding excessive latency to normal filesystem operation.

Waiman Long (11):
  fs/dcache: Fix incorrect accounting of negative dentries
  fs/dcache: Simplify __dentry_kill()
  fs/dcache: Add a counter to track number of children
  fs/dcache: Add sysctl parameter dentry-dir-max
  fs/dcache: Reclaim excessive negative dentries in directories
  fs/dcache: directory opportunistically stores # of positive dentries
  fs/dcache: Add static key negative_reclaim_enable
  fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn()
  fs/dcache: Don't allow small values for dentry-dir-max
  fs/dcache: Kill off dentry as last resort
  fs/dcache: Track # of negative dentries reclaimed & killed

 Documentation/admin-guide/sysctl/fs.rst |  18 +
 fs/dcache.c                             | 428 +++++++++++++++++++++++-
 include/linux/dcache.h                  |  18 +-
 kernel/sysctl.c                         |  11 +
 4 files changed, 457 insertions(+), 18 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/11] fs/dcache: Fix incorrect accounting of negative dentries
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
@ 2020-02-26 16:13 ` " Waiman Long
  2020-02-26 16:13 ` [PATCH 02/11] fs/dcache: Simplify __dentry_kill() Waiman Long
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

The nr_dentry_negative counter only tracks the number of negative
dentries in lru lists, not when they are in shrink lists. In
both __d_clear_type_and_inode() and __d_instantiate(), only the
DCACHE_LRU_LIST flag is checked. Though it is highly unlikely that
the DCACHE_SHRINK_LIST flag may be set, it is still possible. Fix that
by checking the DCACHE_SHRINK_LIST flag as well to make sure that the
accounting is correct.

The negative dentry test is also moved from __d_instantiate() to
__d_set_inode_and_type() to cover more cases.

Fixes: af0c9af1b3f6 ("fs/dcache: Track & report number of negative dentries")

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c            | 13 +++++++------
 include/linux/dcache.h |  9 +++++++++
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b280e07e162b..c17b538bf41c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -315,6 +315,12 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
 {
 	unsigned flags;
 
+	/*
+	 * Decrement negative dentry count if it was in the LRU list.
+	 */
+	if (unlikely(d_in_lru(dentry) && d_is_negative(dentry)))
+		this_cpu_dec(nr_dentry_negative);
+
 	dentry->d_inode = inode;
 	flags = READ_ONCE(dentry->d_flags);
 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
@@ -329,7 +335,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 	WRITE_ONCE(dentry->d_flags, flags);
 	dentry->d_inode = NULL;
-	if (dentry->d_flags & DCACHE_LRU_LIST)
+	if (d_in_lru(dentry))
 		this_cpu_inc(nr_dentry_negative);
 }
 
@@ -1919,11 +1925,6 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 	WARN_ON(d_in_lookup(dentry));
 
 	spin_lock(&dentry->d_lock);
-	/*
-	 * Decrement negative dentry count if it was in the LRU list.
-	 */
-	if (dentry->d_flags & DCACHE_LRU_LIST)
-		this_cpu_dec(nr_dentry_negative);
 	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_set_inode_and_type(dentry, inode, add_flags);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c1488cc84fd9..2762ca2508f9 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -369,6 +369,15 @@ static inline void d_lookup_done(struct dentry *dentry)
 	}
 }
 
+/*
+ * Dentry is in a LRU list, not a shrink list.
+ */
+static inline bool d_in_lru(struct dentry *dentry)
+{
+	return (dentry->d_flags & (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST))
+		== DCACHE_LRU_LIST;
+}
+
 extern void dput(struct dentry *);
 
 static inline bool d_managed(const struct dentry *dentry)
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 02/11] fs/dcache: Simplify __dentry_kill()
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
  2020-02-26 16:13 ` [PATCH 01/11] fs/dcache: Fix incorrect accounting " Waiman Long
@ 2020-02-26 16:13 ` Waiman Long
  2020-02-26 16:13 ` [PATCH 03/11] fs/dcache: Add a counter to track number of children Waiman Long
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

Use the new d_in_lru() helper function to simplify __dentry_kill() a bit.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c17b538bf41c..a977f9e05840 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -572,10 +572,9 @@ static void __dentry_kill(struct dentry *dentry)
 	if (dentry->d_flags & DCACHE_OP_PRUNE)
 		dentry->d_op->d_prune(dentry);
 
-	if (dentry->d_flags & DCACHE_LRU_LIST) {
-		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
-			d_lru_del(dentry);
-	}
+	if (d_in_lru(dentry))
+		d_lru_del(dentry);
+
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
 	dentry_unlist(dentry, parent);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 03/11] fs/dcache: Add a counter to track number of children
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
  2020-02-26 16:13 ` [PATCH 01/11] fs/dcache: Fix incorrect accounting " Waiman Long
  2020-02-26 16:13 ` [PATCH 02/11] fs/dcache: Simplify __dentry_kill() Waiman Long
@ 2020-02-26 16:13 ` Waiman Long
  2020-02-26 16:13 ` [PATCH 04/11] fs/dcache: Add sysctl parameter dentry-dir-max Waiman Long
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

Add a new field d_nchildren to struct dentry to track the number of
children in a directory.

Theoretically, we could use reference count (d_lockref.count) as a
proxy for the number of children. Normally the reference count should
be quite close to the number of children. However, when the directory
dentry is heavily contended, the reference count can differ from the
number of children by quite a bit.

The d_nchildren field is updated only when d_lock has already been held,
so the performance cost of this tracking should be negligible.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c            | 16 ++++++++++++----
 include/linux/dcache.h |  7 ++++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a977f9e05840..0ee5aa2c31cf 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -526,6 +526,8 @@ static inline void dentry_unlist(struct dentry *dentry, struct dentry *parent)
 	if (unlikely(list_empty(&dentry->d_child)))
 		return;
 	__list_del_entry(&dentry->d_child);
+	parent->d_nchildren--;
+
 	/*
 	 * Cursors can move around the list of children.  While we'd been
 	 * a normal list member, it didn't matter - ->d_child.next would've
@@ -1738,6 +1740,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	dentry->d_sb = sb;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
+	dentry->d_nchildren = 0;
 	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
@@ -1782,6 +1785,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	__dget_dlock(parent);
 	dentry->d_parent = parent;
 	list_add(&dentry->d_child, &parent->d_subdirs);
+	parent->d_nchildren++;
 	spin_unlock(&parent->d_lock);
 
 	return dentry;
@@ -2762,10 +2766,10 @@ static void swap_names(struct dentry *dentry, struct dentry *target)
 			 * Both are internal.
 			 */
 			unsigned int i;
-			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
-				swap(((long *) &dentry->d_iname)[i],
-				     ((long *) &target->d_iname)[i]);
+			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(int)));
+			for (i = 0; i < DNAME_INLINE_LEN / sizeof(int); i++) {
+				swap(((int *) &dentry->d_iname)[i],
+				     ((int *) &target->d_iname)[i]);
 			}
 		}
 	}
@@ -2855,6 +2859,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 		dentry->d_parent->d_lockref.count++;
 		if (dentry != old_parent) /* wasn't IS_ROOT */
 			WARN_ON(!--old_parent->d_lockref.count);
+
+		/* Adjust d_nchildren */
+		dentry->d_parent->d_nchildren++;
+		old_parent->d_nchildren--;
 	} else {
 		target->d_parent = old_parent;
 		swap_names(dentry, target);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2762ca2508f9..e9e66eb50d1a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -75,12 +75,12 @@ extern struct dentry_stat_t dentry_stat;
  * large memory footprint increase).
  */
 #ifdef CONFIG_64BIT
-# define DNAME_INLINE_LEN 32 /* 192 bytes */
+# define DNAME_INLINE_LEN 28 /* 192 bytes */
 #else
 # ifdef CONFIG_SMP
-#  define DNAME_INLINE_LEN 36 /* 128 bytes */
+#  define DNAME_INLINE_LEN 32 /* 128 bytes */
 # else
-#  define DNAME_INLINE_LEN 40 /* 128 bytes */
+#  define DNAME_INLINE_LEN 36 /* 128 bytes */
 # endif
 #endif
 
@@ -96,6 +96,7 @@ struct dentry {
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
 	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */
+	unsigned int d_nchildren;	/* # of children (directory only) */
 
 	/* Ref lookup also touches following */
 	struct lockref d_lockref;	/* per-dentry lock and refcount */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 04/11] fs/dcache: Add sysctl parameter dentry-dir-max
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (2 preceding siblings ...)
  2020-02-26 16:13 ` [PATCH 03/11] fs/dcache: Add a counter to track number of children Waiman Long
@ 2020-02-26 16:13 ` Waiman Long
  2020-02-26 16:13 ` [PATCH 05/11] fs/dcache: Reclaim excessive negative dentries in directories Waiman Long
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

The number of positive dentries in a system is limited by the actual
number of files in the system. The number of negative dentries, however,
has no such limit. As a result, it is possible that a system can have a
significant amount of memory tied up in negative dentries. For example,
running a negative dentry generator on a 4-socket 256GB x86-64 system,
the system can use up more than 150GB of memory in dentries and more
than 200GB in slabs and almost running out of free memory before memory
reclaim kicks in.

There are two major problems with having too many negative dentries:

 1) When a filesystem with too many negative dentries is being unmounted,
    the process of draining the dentries associated with the filesystem
    can take some time. To users, the system may seem to hang for
    a while.  The long wait may also cause unexpected timeout error or
    other warnings.  This can happen when a long running container with
    many negative dentries is being destroyed, for instance.

 2) Tying up too much memory in unused negative dentries means there
    are less memory available for other use. Even though the kernel is
    able to reclaim unused dentries when running out of free memory,
    it will still introduce additional latency to the application
    reducing its performance.

This patch introduces a new sysctl parameter "dentry-dir-max" to
/proc/sys/fs. This syctl parameter represents a soft limit on the
total number of negative dentries allowable under any given directory.
Any non-negative integer is allowed. The default is 0 which means there
is no limit.

The actual negative dentry reclaim process to enforce the limit will
be done in a later patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/sysctl/fs.rst | 18 ++++++++++++++++++
 fs/dcache.c                             | 10 ++++++++++
 kernel/sysctl.c                         | 10 ++++++++++
 3 files changed, 38 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..7274a7b34ee4 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/fs:
 
 - aio-max-nr
 - aio-nr
+- dentry-dir-max
 - dentry-state
 - dquot-max
 - dquot-nr
@@ -60,6 +61,23 @@ raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
 
+dentry-dir-max
+--------------
+
+This integer value specifies a soft limit on the maximum number
+of negative dentries that are allowed under any given directory.
+A negative dentry contains filename that is known to be nonexistent
+in the directory.  No restriction is placed on the number of positive
+dentries as it is naturally limited by the number of files in the
+directory.
+
+The default value is 0 which means there is no limit.  Any non-negative
+value is allowed.  However, internal tracking is done on all dentry
+types. So the value given, if non-zero, should be larger than the
+number of files in a typical large directory in order to reduce the
+tracking overhead.
+
+
 dentry-state
 ------------
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 0ee5aa2c31cf..8f3ac31a582b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -123,6 +123,16 @@ static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
 static DEFINE_PER_CPU(long, nr_dentry_negative);
 
+/*
+ * dcache_dentry_dir_max_sysctl:
+ *
+ * This is sysctl parameter "dentry-dir-max" which specifies a limit on
+ * the maximum number of negative dentries that are allowed under any
+ * given directory.
+ */
+int dcache_dentry_dir_max_sysctl __read_mostly;
+EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
+
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d396aaaf19a3..cd0a83ff1029 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -118,6 +118,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+extern int dcache_dentry_dir_max_sysctl;
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -127,6 +128,7 @@ static int sixty = 60;
 static int __maybe_unused neg_one = -1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
+static int __maybe_unused zero;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
 static unsigned long long_max = LONG_MAX;
@@ -1949,6 +1951,14 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "dentry-dir-max",
+		.data		= &dcache_dentry_dir_max_sysctl,
+		.maxlen		= sizeof(dcache_dentry_dir_max_sysctl),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
 	{ }
 };
 
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 05/11] fs/dcache: Reclaim excessive negative dentries in directories
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (3 preceding siblings ...)
  2020-02-26 16:13 ` [PATCH 04/11] fs/dcache: Add sysctl parameter dentry-dir-max Waiman Long
@ 2020-02-26 16:13 ` Waiman Long
  2020-02-26 16:13 ` [PATCH 06/11] fs/dcache: directory opportunistically stores # of positive dentries Waiman Long
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

When the "dentry-dir-max" sysctl parameter is set, it will enable the
checking of dentry count in the parent directory when a negative dentry
is being retained. If the count exceeds the given limit, it will schedule
a work function to scan the children of that parent directory to find
negative dentries to be reclaimed. Positive dentries will not be touched.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c            | 207 +++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |   2 +
 2 files changed, 209 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 8f3ac31a582b..01c6d7277244 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -133,6 +133,11 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
 int dcache_dentry_dir_max_sysctl __read_mostly;
 EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
 
+static LLIST_HEAD(negative_reclaim_list);
+static void negative_reclaim_check(struct dentry *parent);
+static void negative_reclaim_workfn(struct work_struct *work);
+static DECLARE_WORK(negative_reclaim_work, negative_reclaim_workfn);
+
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
 /*
@@ -869,7 +874,20 @@ void dput(struct dentry *dentry)
 		rcu_read_unlock();
 
 		if (likely(retain_dentry(dentry))) {
+			struct dentry *neg_parent = NULL;
+
+			if (d_is_negative(dentry)) {
+				neg_parent = dentry->d_parent;
+				rcu_read_lock();
+			}
 			spin_unlock(&dentry->d_lock);
+
+			/*
+			 * Negative dentry reclaim check is only done when
+			 * a negative dentry is being put into a LRU list.
+			 */
+			if (neg_parent)
+				negative_reclaim_check(neg_parent);
 			return;
 		}
 
@@ -1261,6 +1279,195 @@ void shrink_dcache_sb(struct super_block *sb)
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
+/*
+ * Return true if reclaiming negative dentry can happen.
+ */
+static inline bool can_reclaim_dentry(unsigned int flags)
+{
+	return !(flags & (DCACHE_SHRINK_LIST | DCACHE_GENOCIDE |
+			  DCACHE_DENTRY_KILLED));
+}
+
+struct reclaim_dentry
+{
+	struct llist_node reclaim_node;
+	struct dentry *parent_dir;
+};
+
+/*
+ * Reclaim excess negative dentries in a directory
+ */
+static void reclaim_negative_dentry(struct dentry *parent,
+				    struct list_head *dispose)
+{
+	struct dentry *child;
+	int limit = READ_ONCE(dcache_dentry_dir_max_sysctl);
+	int cnt;
+
+	lockdep_assert_held(&parent->d_lock);
+
+	cnt = parent->d_nchildren;
+
+	/*
+	 * Compute # of negative dentries to be reclaimed
+	 * An extra 1/8 of dcache_dentry_dir_max_sysctl is added.
+	 */
+	if (cnt <= limit)
+		return;
+	cnt -= limit;
+	cnt += (limit >> 3);
+
+	/*
+	 * The d_subdirs is treated like a kind of LRU where
+	 * non-negative dentries are skipped. Negative dentries
+	 * with DCACHE_REFERENCED bit set are also skipped but
+	 * with DCACHE_REFERENCED cleared.
+	 */
+	list_for_each_entry(child, &parent->d_subdirs, d_child) {
+		/*
+		 * This check is racy and so it may not be accurate.
+		 */
+		if (!d_is_negative(child))
+			continue;
+
+		if (!spin_trylock(&child->d_lock))
+			continue;
+
+		/*
+		 * Only reclaim zero-refcnt negative dentries in the
+		 * LRU, but not in shrink list.
+		 */
+		if (can_reclaim_dentry(child->d_flags) &&
+		    d_is_negative(child) && d_in_lru(child) &&
+		    !child->d_lockref.count) {
+			if (child->d_flags & DCACHE_REFERENCED) {
+				child->d_flags &= ~DCACHE_REFERENCED;
+			} else {
+				cnt--;
+				d_lru_del(child);
+				d_shrink_add(child, dispose);
+			}
+		}
+		spin_unlock(&child->d_lock);
+		if (!cnt) {
+			child = list_next_entry(child, d_child);
+			break;
+		}
+	}
+
+	if (&child->d_child != &parent->d_subdirs) {
+		/*
+		 * Split out the childs from the head up to just
+		 * before the current entry into a separate list and
+		 * then splice it to the end of the child list so
+		 * that the unscanned entries will be in the front.
+		 * This simulates LRU.
+		 */
+		struct list_head list;
+
+		list_cut_before(&list, &parent->d_subdirs,
+				&child->d_child);
+		list_splice_tail(&list, &parent->d_subdirs);
+	}
+}
+
+/*
+ * Excess negative dentry reclaim work function.
+ */
+static void negative_reclaim_workfn(struct work_struct *work)
+{
+	struct llist_node *nodes, *next;
+	struct dentry *parent;
+	struct reclaim_dentry *dentry_node;
+
+	/*
+	 * Collect excess negative dentries in dispose list & shrink them.
+	 */
+	for (nodes = llist_del_all(&negative_reclaim_list);
+	     nodes; nodes = next) {
+		LIST_HEAD(dispose);
+
+		next = llist_next(nodes);
+		dentry_node = container_of(nodes, struct reclaim_dentry,
+					   reclaim_node);
+		parent = dentry_node->parent_dir;
+		spin_lock(&parent->d_lock);
+
+		if (d_is_dir(parent) &&
+		    can_reclaim_dentry(parent->d_flags) &&
+		    (parent->d_flags & DCACHE_RECLAIMING))
+			reclaim_negative_dentry(parent, &dispose);
+
+		if (!list_empty(&dispose)) {
+			spin_unlock(&parent->d_lock);
+			shrink_dentry_list(&dispose);
+			spin_lock(&parent->d_lock);
+		}
+
+		parent->d_flags &= ~DCACHE_RECLAIMING;
+		spin_unlock(&parent->d_lock);
+		dput(parent);
+		kfree(dentry_node);
+		cond_resched();
+	}
+}
+
+/*
+ * Check the parent to see if it has too many negative dentries and queue
+ * it up for the negative dentry reclaim work function to handle it.
+ */
+static void negative_reclaim_check(struct dentry *parent)
+	__releases(rcu)
+{
+	int limit = dcache_dentry_dir_max_sysctl;
+	struct reclaim_dentry *dentry_node;
+
+	if (!limit)
+		goto rcu_unlock_out;
+
+	/*
+	 * These checks are racy before spin_lock().
+	 */
+	if (!can_reclaim_dentry(parent->d_flags) ||
+	    (parent->d_flags & DCACHE_RECLAIMING) ||
+	    (READ_ONCE(parent->d_nchildren) <= limit))
+		goto rcu_unlock_out;
+
+	spin_lock(&parent->d_lock);
+	if (!can_reclaim_dentry(parent->d_flags) ||
+	    (parent->d_flags & DCACHE_RECLAIMING) ||
+	    (READ_ONCE(parent->d_nchildren) <= limit))
+		goto unlock_out;
+
+	if (!d_is_dir(parent))
+		goto unlock_out;
+
+	dentry_node = kzalloc(sizeof(*dentry_node), GFP_NOWAIT);
+	if (!dentry_node)
+		goto unlock_out;
+
+	rcu_read_unlock();
+	__dget_dlock(parent);
+	dentry_node->parent_dir = parent;
+	parent->d_flags |= DCACHE_RECLAIMING;
+	spin_unlock(&parent->d_lock);
+
+	/*
+	 * Only call schedule_work() if negative_reclaim_list is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * workfn hasn't retrieved the list yet.
+	 */
+	if (llist_add(&dentry_node->reclaim_node, &negative_reclaim_list))
+		schedule_work(&negative_reclaim_work);
+	return;
+
+unlock_out:
+	spin_unlock(&parent->d_lock);
+rcu_unlock_out:
+	rcu_read_unlock();
+	return;
+}
+
 /**
  * enum d_walk_ret - action to talke during tree walk
  * @D_WALK_CONTINUE:	contrinue walk
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index e9e66eb50d1a..f14d72738903 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -13,6 +13,7 @@
 #include <linux/lockref.h>
 #include <linux/stringhash.h>
 #include <linux/wait.h>
+#include <linux/llist.h>
 
 struct path;
 struct vfsmount;
@@ -214,6 +215,7 @@ struct dentry_operations {
 #define DCACHE_FALLTHRU			0x01000000 /* Fall through to lower layer */
 #define DCACHE_ENCRYPTED_NAME		0x02000000 /* Encrypted name (dir key was unavailable) */
 #define DCACHE_OP_REAL			0x04000000
+#define DCACHE_RECLAIMING		0x08000000 /* Reclaiming negative dentries */
 
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 06/11] fs/dcache: directory opportunistically stores # of positive dentries
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (4 preceding siblings ...)
  2020-02-26 16:13 ` [PATCH 05/11] fs/dcache: Reclaim excessive negative dentries in directories Waiman Long
@ 2020-02-26 16:13 ` Waiman Long
  2020-02-26 16:14 ` [PATCH 07/11] fs/dcache: Add static key negative_reclaim_enable Waiman Long
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:13 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

For directories that contain large number of files (e.g. in thousands),
the number of positive dentries that will never be reclaimed by
the negative dentry reclaiming process may approach or even exceed
"dentry-dir-max". That will unnecessary cause the triggering of the
reclaim process even if there aren't that many negative dentries that
can be reclaimed.

This can impact system performance. One possible way to solve this
problem is to somehow store the number of positive or negative dentries
in the directory dentry itself. The negative dentry count can change
frequently, whereas the positive dentry count is relatively more stable,

Keeping an accurate count of positive or negative dentries can be
costly. Instead, an estimate of the positive dentry count is computed
in the scan loop of the negative dentry reclaim work function.

That computed value is then stored in the trailing end of the d_iname[]
array if there is enough space for it. This value is then used to
estimate the number of negative dentries in the directory to be compare
against the "dentry-dir-max" value.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 81 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 01c6d7277244..c5364c2ed5d8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1294,6 +1294,60 @@ struct reclaim_dentry
 	struct dentry *parent_dir;
 };
 
+/*
+ * If there is enough space at the end of d_iname[] of a directory dentry
+ * to hold an integer. The space will be used to hold an estimate of the
+ * number of positive dentries in the directory. That number will be
+ * subtracted from d_nchildren to see if the limit has been exceeded and
+ * the number of excess negative dentries to be reclaimed.
+ */
+struct d_iname_count {
+	unsigned char d_dummy[DNAME_INLINE_LEN - sizeof(int)];
+	unsigned int d_npositive;
+};
+
+static inline bool dentry_has_npositive(struct dentry *dentry)
+{
+	int len = dentry->d_name.len;
+
+	BUILD_BUG_ON(sizeof(struct d_iname_count) != sizeof(dentry->d_iname));
+
+	return (len >= DNAME_INLINE_LEN) ||
+	       (len < DNAME_INLINE_LEN - sizeof(int));
+}
+
+static inline unsigned int read_dentry_npositive(struct dentry *dentry)
+{
+	struct d_iname_count *p = (struct d_iname_count *)dentry->d_iname;
+
+	return p->d_npositive;
+}
+
+static inline void set_dentry_npositive(struct dentry *dentry,
+					unsigned int npositive)
+{
+	struct d_iname_count *p = (struct d_iname_count *)dentry->d_iname;
+
+	p->d_npositive = npositive;
+}
+
+/*
+ * Get an estimated negative dentry count
+ */
+static inline unsigned int read_dentry_nnegative(struct dentry *dentry)
+{
+	return dentry->d_nchildren - (dentry_has_npositive(dentry)
+					? read_dentry_npositive(dentry) : 0);
+}
+
+/*
+ * Initialize d_iname[] to have null bytes at the end of the array.
+ */
+static inline void init_dentry_iname(struct dentry *dentry)
+{
+	set_dentry_npositive(dentry, 0);
+}
+
 /*
  * Reclaim excess negative dentries in a directory
  */
@@ -1302,11 +1356,11 @@ static void reclaim_negative_dentry(struct dentry *parent,
 {
 	struct dentry *child;
 	int limit = READ_ONCE(dcache_dentry_dir_max_sysctl);
-	int cnt;
+	int cnt, npositive;
 
 	lockdep_assert_held(&parent->d_lock);
 
-	cnt = parent->d_nchildren;
+	cnt = read_dentry_nnegative(parent);
 
 	/*
 	 * Compute # of negative dentries to be reclaimed
@@ -1316,6 +1370,7 @@ static void reclaim_negative_dentry(struct dentry *parent,
 		return;
 	cnt -= limit;
 	cnt += (limit >> 3);
+	npositive = 0;
 
 	/*
 	 * The d_subdirs is treated like a kind of LRU where
@@ -1327,8 +1382,10 @@ static void reclaim_negative_dentry(struct dentry *parent,
 		/*
 		 * This check is racy and so it may not be accurate.
 		 */
-		if (!d_is_negative(child))
+		if (!d_is_negative(child)) {
+			npositive++;
 			continue;
+		}
 
 		if (!spin_trylock(&child->d_lock))
 			continue;
@@ -1368,7 +1425,17 @@ static void reclaim_negative_dentry(struct dentry *parent,
 		list_cut_before(&list, &parent->d_subdirs,
 				&child->d_child);
 		list_splice_tail(&list, &parent->d_subdirs);
+
+		/*
+		 * Update positive dentry count estimate
+		 * Don't allow npositive to decay by more than 1/2.
+		 */
+		if (dentry_has_npositive(parent) &&
+		   (read_dentry_npositive(parent) > 2 * npositive))
+			npositive = read_dentry_npositive(parent) / 2;
 	}
+	if (dentry_has_npositive(parent))
+		set_dentry_npositive(parent, npositive);
 }
 
 /*
@@ -1430,16 +1497,14 @@ static void negative_reclaim_check(struct dentry *parent)
 	 */
 	if (!can_reclaim_dentry(parent->d_flags) ||
 	    (parent->d_flags & DCACHE_RECLAIMING) ||
-	    (READ_ONCE(parent->d_nchildren) <= limit))
+	    (read_dentry_nnegative(parent) <= limit))
 		goto rcu_unlock_out;
 
 	spin_lock(&parent->d_lock);
 	if (!can_reclaim_dentry(parent->d_flags) ||
 	    (parent->d_flags & DCACHE_RECLAIMING) ||
-	    (READ_ONCE(parent->d_nchildren) <= limit))
-		goto unlock_out;
-
-	if (!d_is_dir(parent))
+	    (read_dentry_nnegative(parent) <= limit) ||
+	    !d_is_dir(parent))
 		goto unlock_out;
 
 	dentry_node = kzalloc(sizeof(*dentry_node), GFP_NOWAIT);
@@ -1921,7 +1986,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 * will still always have a NUL at the end, even if we might
 	 * be overwriting an internal NUL character
 	 */
-	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
+	init_dentry_iname(dentry);
 	if (unlikely(!name)) {
 		name = &slash_name;
 		dname = dentry->d_iname;
@@ -2991,11 +3056,18 @@ static void swap_names(struct dentry *dentry, struct dentry *target)
 		}
 	}
 	swap(dentry->d_name.hash_len, target->d_name.hash_len);
+
+	if (dentry_has_npositive(dentry))
+		set_dentry_npositive(dentry, 0);
+	if (dentry_has_npositive(target))
+		set_dentry_npositive(target, 0);
 }
 
 static void copy_name(struct dentry *dentry, struct dentry *target)
 {
 	struct external_name *old_name = NULL;
+
+	init_dentry_iname(dentry);
 	if (unlikely(dname_external(dentry)))
 		old_name = external_name(dentry);
 	if (unlikely(dname_external(target))) {
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 07/11] fs/dcache: Add static key negative_reclaim_enable
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (5 preceding siblings ...)
  2020-02-26 16:13 ` [PATCH 06/11] fs/dcache: directory opportunistically stores # of positive dentries Waiman Long
@ 2020-02-26 16:14 ` Waiman Long
  2020-02-26 16:14 ` [PATCH 08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn() Waiman Long
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:14 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

Add a static_key negative_reclaim_enable to optimize the default case
where negative dentry directory limit "dentry-dir-max" is not set.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c     | 30 ++++++++++++++++++++++++++++--
 kernel/sysctl.c |  3 ++-
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c5364c2ed5d8..149c0a6c1a6e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
 #include <linux/bit_spinlock.h>
 #include <linux/rculist_bl.h>
 #include <linux/list_lru.h>
+#include <linux/jump_label.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -134,11 +135,13 @@ int dcache_dentry_dir_max_sysctl __read_mostly;
 EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
 
 static LLIST_HEAD(negative_reclaim_list);
+static DEFINE_STATIC_KEY_FALSE(negative_reclaim_enable);
 static void negative_reclaim_check(struct dentry *parent);
 static void negative_reclaim_workfn(struct work_struct *work);
 static DECLARE_WORK(negative_reclaim_work, negative_reclaim_workfn);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+proc_handler proc_dcache_dentry_dir_max;
 
 /*
  * Here we resort to our own counters instead of using generic per-cpu counters
@@ -188,6 +191,27 @@ int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer,
 	dentry_stat.nr_negative = get_nr_dentry_negative();
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
+
+/*
+ * Sysctl proc handler for dcache_dentry_dir_max_sysctl
+ */
+int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
+			       void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int old = dcache_dentry_dir_max_sysctl;
+	int ret;
+
+	ret = proc_dointvec_minmax(ctl, write, buffer, lenp, ppos);
+
+	if (!write || ret || (dcache_dentry_dir_max_sysctl == old))
+		return ret;
+
+	if (!old && dcache_dentry_dir_max_sysctl)
+		static_branch_enable(&negative_reclaim_enable);
+	else if (old && !dcache_dentry_dir_max_sysctl)
+		static_branch_disable(&negative_reclaim_enable);
+	return 0;
+}
 #endif
 
 /*
@@ -876,7 +900,8 @@ void dput(struct dentry *dentry)
 		if (likely(retain_dentry(dentry))) {
 			struct dentry *neg_parent = NULL;
 
-			if (d_is_negative(dentry)) {
+			if (static_branch_unlikely(&negative_reclaim_enable) &&
+			    d_is_negative(dentry)) {
 				neg_parent = dentry->d_parent;
 				rcu_read_lock();
 			}
@@ -886,7 +911,8 @@ void dput(struct dentry *dentry)
 			 * Negative dentry reclaim check is only done when
 			 * a negative dentry is being put into a LRU list.
 			 */
-			if (neg_parent)
+			if (static_branch_unlikely(&negative_reclaim_enable) &&
+			    neg_parent)
 				negative_reclaim_check(neg_parent);
 			return;
 		}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cd0a83ff1029..9a4b0a1376e8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -119,6 +119,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 extern int sysctl_nr_trim_pages;
 #endif
 extern int dcache_dentry_dir_max_sysctl;
+extern proc_handler proc_dcache_dentry_dir_max;
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1956,7 +1957,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dcache_dentry_dir_max_sysctl,
 		.maxlen		= sizeof(dcache_dentry_dir_max_sysctl),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_dcache_dentry_dir_max,
 		.extra1		= &zero,
 	},
 	{ }
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn()
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (6 preceding siblings ...)
  2020-02-26 16:14 ` [PATCH 07/11] fs/dcache: Add static key negative_reclaim_enable Waiman Long
@ 2020-02-26 16:14 ` Waiman Long
  2020-02-26 16:14 ` [PATCH 09/11] fs/dcache: Don't allow small values for dentry-dir-max Waiman Long
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:14 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

To limit the d_lock hold time of directory dentry in the negative dentry
reclaim process, a quota (currently 64k) is added to limit the amount of
work that the work function can do and hence its execution time. This is
done to minimize impact on other processes that depend on that d_lock or
other work functions in the same work queue from excessive delay.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 149c0a6c1a6e..0bd5d6974f75 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1374,10 +1374,25 @@ static inline void init_dentry_iname(struct dentry *dentry)
 	set_dentry_npositive(dentry, 0);
 }
 
+/*
+ * In the pathological case where a large number of negative dentries are
+ * generated in a short time in a given directory, there is a possibility
+ * that negative dentries reclaiming process will have many dentries to
+ * be dispose of. Thus the d_lock lock can be hold for too long impacting
+ * other running processes that need it.
+ *
+ * There is also the consideration that a long runtime will impact other
+ * work functions that need to be run in the same work queue. As a result,
+ * we have to limit the number of dentries that can be reclaimed in each
+ * invocation of the work function.
+ */
+#define MAX_DENTRY_RECLAIM	(1 << 16)
+
 /*
  * Reclaim excess negative dentries in a directory
+ * Return: true if the work function needs to be rescheduled, false otherwise
  */
-static void reclaim_negative_dentry(struct dentry *parent,
+static void reclaim_negative_dentry(struct dentry *parent, int *quota,
 				    struct list_head *dispose)
 {
 	struct dentry *child;
@@ -1394,9 +1409,16 @@ static void reclaim_negative_dentry(struct dentry *parent,
 	 */
 	if (cnt <= limit)
 		return;
+
+	npositive = 0;
 	cnt -= limit;
 	cnt += (limit >> 3);
-	npositive = 0;
+	if (cnt >= *quota) {
+		cnt = *quota;
+		*quota = 0;
+	} else {
+		*quota -= cnt;
+	}
 
 	/*
 	 * The d_subdirs is treated like a kind of LRU where
@@ -1462,6 +1484,8 @@ static void reclaim_negative_dentry(struct dentry *parent,
 	}
 	if (dentry_has_npositive(parent))
 		set_dentry_npositive(parent, npositive);
+
+	*quota += cnt;
 }
 
 /*
@@ -1472,6 +1496,7 @@ static void negative_reclaim_workfn(struct work_struct *work)
 	struct llist_node *nodes, *next;
 	struct dentry *parent;
 	struct reclaim_dentry *dentry_node;
+	int quota = MAX_DENTRY_RECLAIM;
 
 	/*
 	 * Collect excess negative dentries in dispose list & shrink them.
@@ -1486,10 +1511,10 @@ static void negative_reclaim_workfn(struct work_struct *work)
 		parent = dentry_node->parent_dir;
 		spin_lock(&parent->d_lock);
 
-		if (d_is_dir(parent) &&
+		if (d_is_dir(parent) && quota &&
 		    can_reclaim_dentry(parent->d_flags) &&
 		    (parent->d_flags & DCACHE_RECLAIMING))
-			reclaim_negative_dentry(parent, &dispose);
+			reclaim_negative_dentry(parent, &quota, &dispose);
 
 		if (!list_empty(&dispose)) {
 			spin_unlock(&parent->d_lock);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 09/11] fs/dcache: Don't allow small values for dentry-dir-max
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (7 preceding siblings ...)
  2020-02-26 16:14 ` [PATCH 08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn() Waiman Long
@ 2020-02-26 16:14 ` Waiman Long
  2020-02-26 16:14 ` [PATCH 10/11] fs/dcache: Kill off dentry as last resort Waiman Long
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:14 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

A small value for "dentry-dir-max", e.g. < 10, will cause excessive
dentry count checking leading to noticeable performance degradation. In
order to make this sysctl parameter more foolproof, we are not going
to allow any positive integer value less than 256.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/sysctl/fs.rst | 10 +++++-----
 fs/dcache.c                             | 24 +++++++++++++++++++-----
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 7274a7b34ee4..e09d851f9d42 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -71,11 +71,11 @@ in the directory.  No restriction is placed on the number of positive
 dentries as it is naturally limited by the number of files in the
 directory.
 
-The default value is 0 which means there is no limit.  Any non-negative
-value is allowed.  However, internal tracking is done on all dentry
-types. So the value given, if non-zero, should be larger than the
-number of files in a typical large directory in order to reduce the
-tracking overhead.
+The default value is 0 which means there is no limit.  Any positive
+integer value not less than 256 is also allowed.  However, internal
+tracking is done on all dentry types. So the value given, if non-zero,
+should be larger than the number of files in a typical large directory
+in order to reduce the tracking overhead.
 
 
 dentry-state
diff --git a/fs/dcache.c b/fs/dcache.c
index 0bd5d6974f75..f470763e7fb8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -129,10 +129,14 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
  *
  * This is sysctl parameter "dentry-dir-max" which specifies a limit on
  * the maximum number of negative dentries that are allowed under any
- * given directory.
+ * given directory. The allowable value of "dentry-dir-max" is either
+ * 0, which means no limit, or 256 and up. A low value of "dentry-dir-max"
+ * will cause excessive dentry count checking affecting system performance.
  */
-int dcache_dentry_dir_max_sysctl __read_mostly;
+int dcache_dentry_dir_max_sysctl;
 EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
+static int negative_dentry_dir_max __read_mostly;
+#define	DENTRY_DIR_MAX_MIN	0x100
 
 static LLIST_HEAD(negative_reclaim_list);
 static DEFINE_STATIC_KEY_FALSE(negative_reclaim_enable);
@@ -206,6 +210,16 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
 	if (!write || ret || (dcache_dentry_dir_max_sysctl == old))
 		return ret;
 
+	/*
+	 * A non-zero value must be >= DENTRY_DIR_MAX_MIN.
+	 */
+	if (dcache_dentry_dir_max_sysctl &&
+	   (dcache_dentry_dir_max_sysctl < DENTRY_DIR_MAX_MIN)) {
+		dcache_dentry_dir_max_sysctl = old;
+		return -EINVAL;
+	}
+
+	negative_dentry_dir_max = dcache_dentry_dir_max_sysctl;
 	if (!old && dcache_dentry_dir_max_sysctl)
 		static_branch_enable(&negative_reclaim_enable);
 	else if (old && !dcache_dentry_dir_max_sysctl)
@@ -1396,7 +1410,7 @@ static void reclaim_negative_dentry(struct dentry *parent, int *quota,
 				    struct list_head *dispose)
 {
 	struct dentry *child;
-	int limit = READ_ONCE(dcache_dentry_dir_max_sysctl);
+	int limit = READ_ONCE(negative_dentry_dir_max);
 	int cnt, npositive;
 
 	lockdep_assert_held(&parent->d_lock);
@@ -1405,7 +1419,7 @@ static void reclaim_negative_dentry(struct dentry *parent, int *quota,
 
 	/*
 	 * Compute # of negative dentries to be reclaimed
-	 * An extra 1/8 of dcache_dentry_dir_max_sysctl is added.
+	 * An extra 1/8 of negative_dentry_dir_max is added.
 	 */
 	if (cnt <= limit)
 		return;
@@ -1537,7 +1551,7 @@ static void negative_reclaim_workfn(struct work_struct *work)
 static void negative_reclaim_check(struct dentry *parent)
 	__releases(rcu)
 {
-	int limit = dcache_dentry_dir_max_sysctl;
+	int limit = negative_dentry_dir_max;
 	struct reclaim_dentry *dentry_node;
 
 	if (!limit)
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 10/11] fs/dcache: Kill off dentry as last resort
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (8 preceding siblings ...)
  2020-02-26 16:14 ` [PATCH 09/11] fs/dcache: Don't allow small values for dentry-dir-max Waiman Long
@ 2020-02-26 16:14 ` Waiman Long
  2020-02-26 16:14 ` [PATCH 11/11] fs/dcache: Track # of negative dentries reclaimed & killed Waiman Long
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:14 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

In the unlikely case that an out-of-control application is generating
negative dentries faster than what the negative dentry reclaim process
can get rid of, we will have to kill the negative dentry directly as
the last resort.

The current threshold for killing negative dentry is the maximum of 4
times dentry-dir-max and 10,000 within a directory.

On a 32-vcpu VM, a 30-thread parallel negative dentry generation problem
was run. Without this patch, the negative dentry reclaim process was
overwhelmed by the negative dentry generator and the number of negative
dentries kept growing. With this patch applied with a "dentry-dir-max"
of 10,000. The number of negative dentries never went beyond 40,000.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f470763e7fb8..fe48e00349c9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -140,7 +140,7 @@ static int negative_dentry_dir_max __read_mostly;
 
 static LLIST_HEAD(negative_reclaim_list);
 static DEFINE_STATIC_KEY_FALSE(negative_reclaim_enable);
-static void negative_reclaim_check(struct dentry *parent);
+static void negative_reclaim_check(struct dentry *parent, struct dentry *child);
 static void negative_reclaim_workfn(struct work_struct *work);
 static DECLARE_WORK(negative_reclaim_work, negative_reclaim_workfn);
 
@@ -927,7 +927,7 @@ void dput(struct dentry *dentry)
 			 */
 			if (static_branch_unlikely(&negative_reclaim_enable) &&
 			    neg_parent)
-				negative_reclaim_check(neg_parent);
+				negative_reclaim_check(neg_parent, dentry);
 			return;
 		}
 
@@ -1548,10 +1548,12 @@ static void negative_reclaim_workfn(struct work_struct *work)
  * Check the parent to see if it has too many negative dentries and queue
  * it up for the negative dentry reclaim work function to handle it.
  */
-static void negative_reclaim_check(struct dentry *parent)
+static void negative_reclaim_check(struct dentry *parent, struct dentry *child)
 	__releases(rcu)
 {
 	int limit = negative_dentry_dir_max;
+	int kill_threshold = max(4 * limit, 10000);
+	int ncnt = read_dentry_nnegative(parent);
 	struct reclaim_dentry *dentry_node;
 
 	if (!limit)
@@ -1560,16 +1562,16 @@ static void negative_reclaim_check(struct dentry *parent)
 	/*
 	 * These checks are racy before spin_lock().
 	 */
-	if (!can_reclaim_dentry(parent->d_flags) ||
-	    (parent->d_flags & DCACHE_RECLAIMING) ||
-	    (read_dentry_nnegative(parent) <= limit))
+	if ((!can_reclaim_dentry(parent->d_flags) ||
+	    (parent->d_flags & DCACHE_RECLAIMING) || (ncnt <= limit)) &&
+	    (ncnt < kill_threshold))
 		goto rcu_unlock_out;
 
 	spin_lock(&parent->d_lock);
+	ncnt = read_dentry_nnegative(parent);
 	if (!can_reclaim_dentry(parent->d_flags) ||
 	    (parent->d_flags & DCACHE_RECLAIMING) ||
-	    (read_dentry_nnegative(parent) <= limit) ||
-	    !d_is_dir(parent))
+	    (ncnt <= limit) || !d_is_dir(parent))
 		goto unlock_out;
 
 	dentry_node = kzalloc(sizeof(*dentry_node), GFP_NOWAIT);
@@ -1592,6 +1594,25 @@ static void negative_reclaim_check(struct dentry *parent)
 	return;
 
 unlock_out:
+	/*
+	 * In the unlikely case that an out-of-control application is
+	 * generating negative dentries faster than what the negative
+	 * dentry reclaim process can get rid of, we will have to kill
+	 * the negative dentry directly as the last resort.
+	 *
+	 * N.B. __dentry_kill() releases both the parent and child's d_lock.
+	 */
+	if (unlikely(ncnt >= kill_threshold)) {
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		if (can_reclaim_dentry(child->d_flags) &&
+		    !child->d_lockref.count && (child->d_parent == parent)) {
+			rcu_read_unlock();
+			__dentry_kill(child);
+			dput(parent);
+			return;
+		}
+		spin_unlock(&child->d_lock);
+	}
 	spin_unlock(&parent->d_lock);
 rcu_unlock_out:
 	rcu_read_unlock();
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 11/11] fs/dcache: Track # of negative dentries reclaimed & killed
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (9 preceding siblings ...)
  2020-02-26 16:14 ` [PATCH 10/11] fs/dcache: Kill off dentry as last resort Waiman Long
@ 2020-02-26 16:14 ` Waiman Long
  2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-26 16:14 UTC (permalink / raw)
  To: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: linux-kernel, linux-fsdevel, linux-doc, Mauro Carvalho Chehab,
	Eric Biggers, Dave Chinner, Eric Sandeen, Waiman Long

The negative dentry reclaim process gave no visible indication that
it was being activated. In order to allow system administrator to
see if it is being activated as expected, two new debugfs variables
"negative_dentry_reclaimed" and "negative_dentry_killed" are now added
to report the total number of negative dentries that have been reclaimed
and killed. These debugfs variables are only added after the negative
dentry reclaim mechanism is activated for the first time.

In reality, the actual number may be slightly less than the reported
number as not all the negative dentries passed to shrink_dentry_list()
and __dentry_kill() can be successfully reclaimed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/dcache.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index fe48e00349c9..471b51316506 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -33,6 +33,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/list_lru.h>
 #include <linux/jump_label.h>
+#include <linux/debugfs.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -136,6 +137,8 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
 int dcache_dentry_dir_max_sysctl;
 EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
 static int negative_dentry_dir_max __read_mostly;
+static unsigned long negative_dentry_reclaim_count;
+static atomic_t negative_dentry_kill_count;
 #define	DENTRY_DIR_MAX_MIN	0x100
 
 static LLIST_HEAD(negative_reclaim_list);
@@ -204,6 +207,7 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
 {
 	int old = dcache_dentry_dir_max_sysctl;
 	int ret;
+	static bool debugfs_file_created;
 
 	ret = proc_dointvec_minmax(ctl, write, buffer, lenp, ppos);
 
@@ -219,6 +223,14 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
 		return -EINVAL;
 	}
 
+	if (!debugfs_file_created) {
+		debugfs_create_ulong("negative_dentry_reclaimed", 0400, NULL,
+				     &negative_dentry_reclaim_count);
+		debugfs_create_u32("negative_dentry_killed", 0400, NULL,
+				   (u32 *)&negative_dentry_kill_count.counter);
+		debugfs_file_created = true;
+	}
+
 	negative_dentry_dir_max = dcache_dentry_dir_max_sysctl;
 	if (!old && dcache_dentry_dir_max_sysctl)
 		static_branch_enable(&negative_reclaim_enable);
@@ -1542,6 +1554,8 @@ static void negative_reclaim_workfn(struct work_struct *work)
 		kfree(dentry_node);
 		cond_resched();
 	}
+	if (quota < MAX_DENTRY_RECLAIM)
+		negative_dentry_reclaim_count += MAX_DENTRY_RECLAIM - quota;
 }
 
 /*
@@ -1609,6 +1623,7 @@ static void negative_reclaim_check(struct dentry *parent, struct dentry *child)
 			rcu_read_unlock();
 			__dentry_kill(child);
 			dput(parent);
+			atomic_inc(&negative_dentry_kill_count);
 			return;
 		}
 		spin_unlock(&child->d_lock);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (10 preceding siblings ...)
  2020-02-26 16:14 ` [PATCH 11/11] fs/dcache: Track # of negative dentries reclaimed & killed Waiman Long
@ 2020-02-26 16:29 ` Matthew Wilcox
  2020-02-26 19:19   ` Waiman Long
                     ` (2 more replies)
  2020-02-27  8:30 ` Dave Chinner
  2020-03-15  3:46 ` Matthew Wilcox
  13 siblings, 3 replies; 34+ messages in thread
From: Matthew Wilcox @ 2020-02-26 16:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> value of 0 (default) for no limit or a positive integer 256 and up. Small
> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> checking which can impact system performance.

This is always the wrong approach.  A sysctl is just a way of blaming
the sysadmin for us not being very good at programming.

I agree that we need a way to limit the number of negative dentries.
But that limit needs to be dynamic and depend on how the system is being
used, not on how some overworked sysadmin has configured it.

So we need an initial estimate for the number of negative dentries that
we need for good performance.  Maybe it's 1000.  It doesn't really matter;
it's going to change dynamically.

Then we need a metric to let us know whether it needs to be increased.
Perhaps that's "number of new negative dentries created in the last
second".  And we need to decide how much to increase it; maybe it's by
50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
how high the recent rate of negative dentry creation has been.

We also need a metric to let us know whether it needs to be decreased.
I'm reluctant to say that memory pressure should be that metric because
very large systems can let the number of dentries grow in an unbounded
way.  Perhaps that metric is "number of hits in the negative dentry
cache in the last ten seconds".  Again, we'll need to decide how much
to shrink the target number by.

If the number of negative dentries is at or above the target, then
creating a new negative dentry means evicting an existing negative dentry.
If the number of negative dentries is lower than the target, then we
can just create a new one.

Of course, memory pressure (and shrinking the target number) should
cause negative dentries to be evicted from the old end of the LRU list.
But memory pressure shouldn't cause us to change the target number;
the target number is what we think we need to keep the system running
smoothly.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
@ 2020-02-26 19:19   ` Waiman Long
  2020-02-26 21:28     ` Matthew Wilcox
  2020-02-26 21:28   ` Andreas Dilger
  2020-02-27 19:04   ` Eric Sandeen
  2 siblings, 1 reply; 34+ messages in thread
From: Waiman Long @ 2020-02-26 19:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
>
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
>
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
>
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
>
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way.  Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds".  Again, we'll need to decide how much
> to shrink the target number by.
>
> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
>
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.
>
Thanks for the quick response.

I agree that auto-tuning so that the system administrator don't have to
worry about it will be the best approach if it is implemented in the
right way. However, it is hard to do it right.

How about letting users specify a cap on the amount of total system
memory allowed for negative dentries like one of my previous patchs.
Internally, there is a predefined minimum and maximum for
dentry-dir-max. We sample the total negative dentry counts periodically
and adjust the dentry-dir-max accordingly.

Specifying a percentage of total system memory is more intuitive than
just specifying a hard number for dentry-dir-max. Still some user input
is required.

What do you think?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 19:19   ` Waiman Long
@ 2020-02-26 21:28     ` Matthew Wilcox
  0 siblings, 0 replies; 34+ messages in thread
From: Matthew Wilcox @ 2020-02-26 21:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Wed, Feb 26, 2020 at 02:19:59PM -0500, Waiman Long wrote:
> On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> >
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how much
> > to shrink the target number by.
> >
> > If the number of negative dentries is at or above the target, then
> > creating a new negative dentry means evicting an existing negative dentry.
> > If the number of negative dentries is lower than the target, then we
> > can just create a new one.
> >
> > Of course, memory pressure (and shrinking the target number) should
> > cause negative dentries to be evicted from the old end of the LRU list.
> > But memory pressure shouldn't cause us to change the target number;
> > the target number is what we think we need to keep the system running
> > smoothly.
> >
> Thanks for the quick response.
> 
> I agree that auto-tuning so that the system administrator don't have to
> worry about it will be the best approach if it is implemented in the
> right way. However, it is hard to do it right.
> 
> How about letting users specify a cap on the amount of total system
> memory allowed for negative dentries like one of my previous patchs.
> Internally, there is a predefined minimum and maximum for
> dentry-dir-max. We sample the total negative dentry counts periodically
> and adjust the dentry-dir-max accordingly.
> 
> Specifying a percentage of total system memory is more intuitive than
> just specifying a hard number for dentry-dir-max. Still some user input
> is required.

If you want to base the whole thing on a per-directory target number,
or a percentage of the system memory target (rather than my suggestion
of a total # of negative dentries), that seems reasonable.  What's not
reasonable is expecting the sysadmin to be able to either predict the
workload, or react to a changing workload in sufficient time.  The system
has to be self-tuning.

Just look how long stale information stays around about how to tune your
Linux system.  Here's an article from 2018 suggesting using the 'intr'
option for NFS mounts:
https://kb.netapp.com/app/answers/answer_view/a_id/1004893/~/hard-mount-vs-soft-mount-
I made that a no-op in 2007.  Any tunable you add to Linux immediately
becomes a cargo-cult solution to any problem people are having.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
  2020-02-26 19:19   ` Waiman Long
@ 2020-02-26 21:28   ` Andreas Dilger
  2020-02-26 21:45     ` Matthew Wilcox
  2020-02-27  9:55     ` Ian Kent
  2020-02-27 19:04   ` Eric Sandeen
  2 siblings, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2020-02-26 21:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Linux Kernel Mailing List,
	Linux FS Devel, linux-doc, Mauro Carvalho Chehab, Eric Biggers,
	Dave Chinner, Eric Sandeen

[-- Attachment #1: Type: text/plain, Size: 4345 bytes --]

On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> 
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
> 
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
> 
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
> 
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
> 
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way.  Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds".  Again, we'll need to decide how much
> to shrink the target number by.

OK, so now instead of a single tunable parameter we need three, because
these numbers are totally made up and nobody knows the right values. :-)
Defaulting the limit to "disabled/no limit" also has the problem that
99.99% of users won't even know this tunable exists, let alone how to
set it correctly, so they will continue to see these problems, and the
code may as well not exist (i.e. pure overhead), while Waiman has a
better idea today of what would be reasonable defaults.

I definitely agree that a single fixed value will be wrong for every
system except the original developer's.  Making the maximum default to
some reasonable fraction of the system size, rather than a fixed value,
is probably best to start.  Something like this as a starting point:

	/* Allow a reasonable minimum number of negative entries,
	 * but proportionately more if the directory/dcache is large.
	 */
	dir_negative_max = max(num_dir_entries / 16, 1024);
        total_negative_max = max(totalram_pages / 32, total_dentries / 8);

(Waiman should decide actual values based on where the problem was hit
previously), and include tunables to change the limits for testing.

Ideally there would also be a dir ioctl that allows fetching the current
positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
/usr/share/man/man*) to see what these values are.  Otherwise there is
no way to determine whether the limits used are any good or not.

Dynamic limits are hard to get right, and incorrect state machines can lead
to wild swings in behaviour due to unexpected feedback.  It isn't clear to
me that adjusting the limit based on the current rate of negative dentry
creation even makes sense.  If there are a lot of negative entries being
created, that is when you'd want to _stop_ allowing more to be added.

We don't have any limit today, so imposing some large-but-still-reasonable
upper limit on negative entries will catch the runaway negative dcache case
that was the original need of this functionality without adding a lot of
complexity that we may not need at all.

> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
> 
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 21:28   ` Andreas Dilger
@ 2020-02-26 21:45     ` Matthew Wilcox
  2020-02-27  8:07       ` Dave Chinner
  2020-02-27  9:55     ` Ian Kent
  1 sibling, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2020-02-26 21:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Waiman Long, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Linux Kernel Mailing List,
	Linux FS Devel, linux-doc, Mauro Carvalho Chehab, Eric Biggers,
	Dave Chinner, Eric Sandeen

On Wed, Feb 26, 2020 at 02:28:50PM -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> > 
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how much
> > to shrink the target number by.
> 
> OK, so now instead of a single tunable parameter we need three, because
> these numbers are totally made up and nobody knows the right values. :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

I never said "no limit".  I just said to start at some fairly random
value and not worry about where you start because it'll correct to where
this system needs it to be.  As long as it converges like loadavg does,
it'll be fine.  It needs a fairly large "don't change the target" area,
and it needs to react quickly to real changes in a system's workload.

> I definitely agree that a single fixed value will be wrong for every
> system except the original developer's.  Making the maximum default to
> some reasonable fraction of the system size, rather than a fixed value,
> is probably best to start.  Something like this as a starting point:
> 
> 	/* Allow a reasonable minimum number of negative entries,
> 	 * but proportionately more if the directory/dcache is large.
> 	 */
> 	dir_negative_max = max(num_dir_entries / 16, 1024);
>         total_negative_max = max(totalram_pages / 32, total_dentries / 8);

Those kinds of things are garbage on large machines.  With a terabyte
of RAM, you can end up with tens of millions of dentries clogging up
the system.  There _is_ an upper limit on the useful number of dentries
to keep around.

> (Waiman should decide actual values based on where the problem was hit
> previously), and include tunables to change the limits for testing.
> 
> Ideally there would also be a dir ioctl that allows fetching the current
> positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
> /usr/share/man/man*) to see what these values are.  Otherwise there is
> no way to determine whether the limits used are any good or not.

It definitely needs to be instrumented for testing, but no, not ioctls.
tracepoints, perhaps.

> Dynamic limits are hard to get right, and incorrect state machines can lead
> to wild swings in behaviour due to unexpected feedback.  It isn't clear to
> me that adjusting the limit based on the current rate of negative dentry
> creation even makes sense.  If there are a lot of negative entries being
> created, that is when you'd want to _stop_ allowing more to be added.

That doesn't make sense.  What you really want to know is "If my dcache
had twice as many entries in it, would that significantly reduce the
thrash of new entries being created".  In the page cache, we end up
with a double LRU where once-used entries fall off the list quickly
but twice-or-more used entries get to stay around for a bit longer.
Maybe we could do something like that; keep a victim cache for recently
evicted dentries, and if we get a large hit rate in the victim cache,
expand the size of the primary cache.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 21:45     ` Matthew Wilcox
@ 2020-02-27  8:07       ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2020-02-27  8:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Dilger, Waiman Long, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Eric Sandeen

On Wed, Feb 26, 2020 at 01:45:07PM -0800, Matthew Wilcox wrote:
> had twice as many entries in it, would that significantly reduce the
> thrash of new entries being created".  In the page cache, we end up
> with a double LRU where once-used entries fall off the list quickly
> but twice-or-more used entries get to stay around for a bit longer.
> Maybe we could do something like that; keep a victim cache for recently
> evicted dentries, and if we get a large hit rate in the victim cache,
> expand the size of the primary cache.

You know, I've been saying exactly the same thing about the inode
LRU in response to people trying to hack behaviour out of the
shrinker that is triggered by the working set getting trashed by
excessive creation of single use inodes (i.e. large scale directory
traversal).

IOWs, both have the same problem with working set retention in the
face of excessive growth pressure.

So, you know, perhaps two caches with the same problem, that use the
same LRU implementation, could solve the same problem by enhancing
the generic LRU code they use to an active/inactive style clocking
LRU like the page LRUs?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (11 preceding siblings ...)
  2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
@ 2020-02-27  8:30 ` Dave Chinner
  2020-02-28 15:47   ` Waiman Long
  2020-03-15  3:46 ` Matthew Wilcox
  13 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-02-27  8:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Eric Sandeen

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:

I don't get it.

Why isn't the solution simply "constrain the application generating
unbound numbers of dentries to a memcg"?

Then when the memcg runs out of memory, it will start reclaiming the
dentries that were allocated inside the memcg that are using all
it's resources, thereby preventing unbound growth of the dentry
cache.

I mean, this sort of resource control is exactly what memcgs are
supposed to be used for and are already used for. I don't see why we
need all this complexity for global dentry resource management when
memcgs should already provide an effective means of managing and
placing bounds on the amount of memory any specific application can
use...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 21:28   ` Andreas Dilger
  2020-02-26 21:45     ` Matthew Wilcox
@ 2020-02-27  9:55     ` Ian Kent
  2020-02-28  3:34       ` Matthew Wilcox
  1 sibling, 1 reply; 34+ messages in thread
From: Ian Kent @ 2020-02-27  9:55 UTC (permalink / raw)
  To: Andreas Dilger, Matthew Wilcox
  Cc: Waiman Long, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Linux Kernel Mailing List,
	Linux FS Devel, linux-doc, Mauro Carvalho Chehab, Eric Biggers,
	Dave Chinner, Eric Sandeen

On Wed, 2020-02-26 at 14:28 -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org>
> wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> > > A new sysctl parameter "dentry-dir-max" is introduced which
> > > accepts a
> > > value of 0 (default) for no limit or a positive integer 256 and
> > > up. Small
> > > dentry-dir-max numbers are forbidden to avoid excessive dentry
> > > count
> > > checking which can impact system performance.
> > 
> > This is always the wrong approach.  A sysctl is just a way of
> > blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative
> > dentries.
> > But that limit needs to be dynamic and depend on how the system is
> > being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries
> > that
> > we need for good performance.  Maybe it's 1000.  It doesn't really
> > matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be
> > increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's
> > by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending
> > on
> > how high the recent rate of negative dentry creation has been.
> > 
> > We also need a metric to let us know whether it needs to be
> > decreased.
> > I'm reluctant to say that memory pressure should be that metric
> > because
> > very large systems can let the number of dentries grow in an
> > unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how
> > much
> > to shrink the target number by.
> 
> OK, so now instead of a single tunable parameter we need three,
> because
> these numbers are totally made up and nobody knows the right values.
> :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and
> the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

Why have these at all.

Not all file systems even produce negative hashed dentries.

The most beneficial use of them is to improve performance of rapid
fire lookups for non-existent names. Longer lived negative hashed
dentries don't give much benefit at all unless they suddenly have
lots of hits and that would cost a single allocation on the first
lookup if the dentry ttl expired and the dentry discarded.

A ttl (say jiffies) set at appropriate times could be a better
choice all round, no sysctl values at all.

Ian


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
  2020-02-26 19:19   ` Waiman Long
  2020-02-26 21:28   ` Andreas Dilger
@ 2020-02-27 19:04   ` Eric Sandeen
  2020-02-27 22:39     ` Dave Chinner
  2 siblings, 1 reply; 34+ messages in thread
From: Eric Sandeen @ 2020-02-27 19:04 UTC (permalink / raw)
  To: Matthew Wilcox, Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner

On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> 
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
> 
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
> 
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
> 
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.

There are pitfalls to this approach as well.  Consider what libnss
does every time it starts up (via curl in this case)

# cat /proc/sys/fs/dentry-state
3154271	3131421	45	0	2863333	0
# for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
# cat /proc/sys/fs/dentry-state
3170738	3147844	45	0	2879882	0

voila, 16k more negative dcache entries, thanks to:

https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390

i.e. each time it inits, it will intentionally create up to 10,000 negative
dentries which will never be looked up again.  I /think/ the original intent
of this work was to limit such rogue applications, so scaling with use probably
isn't the way to go.

-Eric


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-27 19:04   ` Eric Sandeen
@ 2020-02-27 22:39     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2020-02-27 22:39 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Matthew Wilcox, Waiman Long, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-kernel,
	linux-fsdevel, linux-doc, Mauro Carvalho Chehab, Eric Biggers

On Thu, Feb 27, 2020 at 11:04:40AM -0800, Eric Sandeen wrote:
> On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> > 
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> 
> There are pitfalls to this approach as well.  Consider what libnss
> does every time it starts up (via curl in this case)
> 
> # cat /proc/sys/fs/dentry-state
> 3154271	3131421	45	0	2863333	0
> # for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
> # cat /proc/sys/fs/dentry-state
> 3170738	3147844	45	0	2879882	0
> 
> voila, 16k more negative dcache entries, thanks to:
> 
> https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390
> 
> i.e. each time it inits, it will intentionally create up to 10,000 negative
> dentries which will never be looked up again.

Sandboxing via memcg restricted cgroups means users and/or
applications cannot create unbound numbers of negative dentries, and
that largely solves this problem.

For a system daemons whose environment is controlled by a
systemd unit file, this should be pretty trivial to do, and memcg
directed memory reclaim will control negative dentry buildup.

For short-lived applications, teardown of the cgroup will free
all the negative dentries it created - they don't hang around
forever.

For long lived applications, negative dentries are bound by the
application memcg limits, and buildup will only affect the
applications own performance, not that of the whole system.

IOWs, I'd expect this sort of resource control problem to be solved
at the user, application and/or distro level, not with a huge kernel
hammer.

> I /think/ the original intent of this work was to limit such rogue
> applications, so scaling with use probably isn't the way to go.

The original intent was to prevent problems on old kernels that
supported terabytes of memory but could not use cgroup/memcg
infrastructure to isolate and contain negative dentry growth.
That was a much simpler, targeted negative dentry limiting solution,
not the ... craziness that can be found in this patchset.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-27  9:55     ` Ian Kent
@ 2020-02-28  3:34       ` Matthew Wilcox
  2020-02-28  4:16         ` Ian Kent
                           ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Matthew Wilcox @ 2020-02-28  3:34 UTC (permalink / raw)
  To: Ian Kent
  Cc: Andreas Dilger, Waiman Long, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> Not all file systems even produce negative hashed dentries.
> 
> The most beneficial use of them is to improve performance of rapid
> fire lookups for non-existent names. Longer lived negative hashed
> dentries don't give much benefit at all unless they suddenly have
> lots of hits and that would cost a single allocation on the first
> lookup if the dentry ttl expired and the dentry discarded.
> 
> A ttl (say jiffies) set at appropriate times could be a better
> choice all round, no sysctl values at all.

The canonical argument in favour of negative dentries is to improve
application startup time as every application searches the library path
for the same libraries.  Only they don't do that any more:

$ strace -e file cat /dev/null
execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3

So, are they still useful?  Or should we, say, keep at most 100 around?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  3:34       ` Matthew Wilcox
@ 2020-02-28  4:16         ` Ian Kent
  2020-02-28  4:36           ` Ian Kent
  2020-02-28  4:22         ` Al Viro
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Ian Kent @ 2020-02-28  4:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Dilger, Waiman Long, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> > 
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> > 
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library
> path
> for the same libraries.  Only they don't do that any more:
> 
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> */) = 0
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or
> directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
> 
> So, are they still useful?  Or should we, say, keep at most 100
> around?

Who knows how old apps will be on distros., ;)

But I don't think it matters.

The VFS will (should) work fine without a minimum negative hashed
dentry count (and hashed since unhashed negative dentries are
summarily executed on final dput()) and a ttl should keep frequently
used ones around long enough to satisfy this sort of thing should it
be needed.

Even the ttl value should be resilient to a large range of values,
just not so much very small ones.

Ian


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  3:34       ` Matthew Wilcox
  2020-02-28  4:16         ` Ian Kent
@ 2020-02-28  4:22         ` Al Viro
  2020-02-28  4:52           ` Ian Kent
  2020-02-28 15:32         ` Waiman Long
  2020-02-28 19:32         ` Theodore Y. Ts'o
  3 siblings, 1 reply; 34+ messages in thread
From: Al Viro @ 2020-02-28  4:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ian Kent, Andreas Dilger, Waiman Long, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> > 
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> > 
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.  Only they don't do that any more:

	Tell that to scripts that keep looking through $PATH for
binaries each time they are run.  Tell that to cc(1) looking through
include path, etc.

	Ian, autofs is deeply pathological in that respect; that's OK,
since it has very unusual needs, but please don't use it as a model
for anything else - its needs *are* unusual.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  4:16         ` Ian Kent
@ 2020-02-28  4:36           ` Ian Kent
  2020-02-28  4:52             ` Al Viro
  0 siblings, 1 reply; 34+ messages in thread
From: Ian Kent @ 2020-02-28  4:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Dilger, Waiman Long, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Fri, 2020-02-28 at 12:16 +0800, Ian Kent wrote:
> On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > > 
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > > 
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> > 
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries.  Only they don't do that any more:
> > 
> > $ strace -e file cat /dev/null
> > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> > */) = 0
> > access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file
> > or
> > directory)
> > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
> > 
> > So, are they still useful?  Or should we, say, keep at most 100
> > around?
> 
> Who knows how old apps will be on distros., ;)

Or what admins put in the PATH, I've seen oddness in that
a lot.

> 
> But I don't think it matters.

And I don't think I made my answer to the question clear.

I don't think setting a minimum matters but there are other
sources of a possibly significant number of lookups on
paths that don't exist. I've seen evidence recently
(although I suspect unfounded) that systemd can generate
lots of these lookups at times.

And let's not forget that file systems are the primary
source of these and not all create them on lookups.
I may be mistaken, but I think ext4 does not while xfs
definitely does.

The more important metric I think is calculating a sensible
maximum to be pruned to prevent getting bogged down as there
could be times when a lot of these are present. After all this
is meant to be an iterative pruning measure.

Ian


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  4:22         ` Al Viro
@ 2020-02-28  4:52           ` Ian Kent
  0 siblings, 0 replies; 34+ messages in thread
From: Ian Kent @ 2020-02-28  4:52 UTC (permalink / raw)
  To: Al Viro, Matthew Wilcox
  Cc: Andreas Dilger, Waiman Long, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Linux Kernel Mailing List,
	Linux FS Devel, linux-doc, Mauro Carvalho Chehab, Eric Biggers,
	Dave Chinner, Eric Sandeen

On Fri, 2020-02-28 at 04:22 +0000, Al Viro wrote:
> On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > > 
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > > 
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> > 
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries.  Only they don't do that any more:
> 
> 	Tell that to scripts that keep looking through $PATH for
> binaries each time they are run.  Tell that to cc(1) looking through
> include path, etc.
> 
> 	Ian, autofs is deeply pathological in that respect; that's OK,
> since it has very unusual needs, but please don't use it as a model
> for anything else - its needs *are* unusual.

Ok, but my thoughts aren't based on autofs behaviours.

But it sounds like you don't believe this is a sensible suggestion
and you would know best so ...

Ian


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  4:36           ` Ian Kent
@ 2020-02-28  4:52             ` Al Viro
  0 siblings, 0 replies; 34+ messages in thread
From: Al Viro @ 2020-02-28  4:52 UTC (permalink / raw)
  To: Ian Kent
  Cc: Matthew Wilcox, Andreas Dilger, Waiman Long, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Fri, Feb 28, 2020 at 12:36:09PM +0800, Ian Kent wrote:

> And let's not forget that file systems are the primary
> source of these and not all create them on lookups.
> I may be mistaken, but I think ext4 does not while xfs
> definitely does.

Both ext4 and xfs bloody well *DO* create hashed negative
dentries on lookups.  There is a pathological case when
they are trying to be case-insensitive (and in that situation
we are SOL - if somebody insists upon mounting with
-o make-it-suck, that's what they bloody well get).

Casefondling idiocy aside, negative lookups are hashed.
On all normal filesystems.  Look for d_splice_alias()
getting passed NULL inode - that's where ->lookup()
instances normally create those.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  3:34       ` Matthew Wilcox
  2020-02-28  4:16         ` Ian Kent
  2020-02-28  4:22         ` Al Viro
@ 2020-02-28 15:32         ` Waiman Long
  2020-02-28 15:39           ` Matthew Wilcox
  2020-02-28 19:32         ` Theodore Y. Ts'o
  3 siblings, 1 reply; 34+ messages in thread
From: Waiman Long @ 2020-02-28 15:32 UTC (permalink / raw)
  To: Matthew Wilcox, Ian Kent
  Cc: Andreas Dilger, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
>> Not all file systems even produce negative hashed dentries.
>>
>> The most beneficial use of them is to improve performance of rapid
>> fire lookups for non-existent names. Longer lived negative hashed
>> dentries don't give much benefit at all unless they suddenly have
>> lots of hits and that would cost a single allocation on the first
>> lookup if the dentry ttl expired and the dentry discarded.
>>
>> A ttl (say jiffies) set at appropriate times could be a better
>> choice all round, no sysctl values at all.
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.  Only they don't do that any more:
>
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
>
> So, are they still useful?  Or should we, say, keep at most 100 around?
>
It is the shell that does the path search, not the command itself.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28 15:32         ` Waiman Long
@ 2020-02-28 15:39           ` Matthew Wilcox
  0 siblings, 0 replies; 34+ messages in thread
From: Matthew Wilcox @ 2020-02-28 15:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ian Kent, Andreas Dilger, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Fri, Feb 28, 2020 at 10:32:02AM -0500, Waiman Long wrote:
> On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library path
                                                               ^^^^^^^
> > for the same libraries.  Only they don't do that any more:
                 ^^^^^^^^^
>
> It is the shell that does the path search, not the command itself.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-27  8:30 ` Dave Chinner
@ 2020-02-28 15:47   ` Waiman Long
  0 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2020-02-28 15:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Eric Sandeen

On 2/27/20 3:30 AM, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
> I don't get it.
>
> Why isn't the solution simply "constrain the application generating
> unbound numbers of dentries to a memcg"?
>
> Then when the memcg runs out of memory, it will start reclaiming the
> dentries that were allocated inside the memcg that are using all
> it's resources, thereby preventing unbound growth of the dentry
> cache.
>
> I mean, this sort of resource control is exactly what memcgs are
> supposed to be used for and are already used for. I don't see why we
> need all this complexity for global dentry resource management when
> memcgs should already provide an effective means of managing and
> placing bounds on the amount of memory any specific application can
> use...

Using memcg is one way to limit the damage. The argument that excessive
negative dentries can push out existing memory objects that can be more
useful if left alone still applies. Daemons that run in the root memcg
has no limitation on how much memory that they can use.

There can also be memcgs with high memory limits and long running
applications. memcg is certainly a useful tool in this regards, but it
doesn't solve all the problem.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-28  3:34       ` Matthew Wilcox
                           ` (2 preceding siblings ...)
  2020-02-28 15:32         ` Waiman Long
@ 2020-02-28 19:32         ` Theodore Y. Ts'o
  3 siblings, 0 replies; 34+ messages in thread
From: Theodore Y. Ts'o @ 2020-02-28 19:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ian Kent, Andreas Dilger, Waiman Long, Alexander Viro,
	Jonathan Corbet, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Linux Kernel Mailing List, Linux FS Devel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.

The other canonical example is C compilers that need to search for
header files along the include search path:

% strace  -o /tmp/st -f gcc -o /tmp/hello /tmp/hello.c -I.. -I../..
% grep open /tmp/st | grep stdio.h | grep ENOENT | wc -l
6

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
                   ` (12 preceding siblings ...)
  2020-02-27  8:30 ` Dave Chinner
@ 2020-03-15  3:46 ` Matthew Wilcox
  2020-03-21 10:17   ` Konstantin Khlebnikov
  13 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2020-03-15  3:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:
> 
>  1) When a filesystem with too many negative dentries is being unmounted,
>     the process of draining the dentries associated with the filesystem
>     can take some time. To users, the system may seem to hang for
>     a while.  The long wait may also cause unexpected timeout error or
>     other warnings.  This can happen when a long running container with
>     many negative dentries is being destroyed, for instance.
> 
>  2) Tying up too much memory in unused negative dentries means there
>     are less memory available for other use. Even though the kernel is
>     able to reclaim unused dentries when running out of free memory,
>     it will still introduce additional latency to the application
>     reducing its performance.

There's a third problem, which is that having a lot of negative dentries
can clog the hash chains.  I tried to quantify this, and found a weird result:

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.402s
user	0m4.361s
sys	0m1.230s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.572s
user	0m4.337s
sys	0m1.407s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.607s
user	0m4.522s
sys	0m1.342s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.599s
user	0m4.472s
sys	0m1.369s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.574s
user	0m4.498s
sys	0m1.300s

Pretty consistent system time, between about 1.3 and 1.4 seconds.

root@bobo-kvm:~# grep dentry /proc/slabinfo 
dentry             20394  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.515s
user	0m4.353s
sys	0m1.359s

At this point, we have 20k dentries allocated.

Now, pollute the dcache with names that don't exist:

root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
root@bobo-kvm:~# grep dentry /proc/slabinfo 
dentry             20605  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0

Huh.  We've kept the number of dentries pretty constant.  Still, maybe the
bad dentries have pushed out the good ones.

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.644s
user	0m4.921s
sys	0m1.946s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.676s
user	0m5.004s
sys	0m1.909s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.662s
user	0m4.980s
sys	0m1.916s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.714s
user	0m4.973s
sys	0m1.986s

Well, we certainly made it suck.  Up to a pretty consistent 1.9-2.0 seconds
of kernel time, or 50% worse.  We've also made user time worse, somehow.

Anyhow, I should write a proper C program to measure this.  But I thought
I'd share this raw data with you now to demonstrate that dcache pollution
is a real problem today, even on a machine with 2GB.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries
  2020-03-15  3:46 ` Matthew Wilcox
@ 2020-03-21 10:17   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Khlebnikov @ 2020-03-21 10:17 UTC (permalink / raw)
  To: Matthew Wilcox, Waiman Long
  Cc: Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-kernel, linux-fsdevel, linux-doc,
	Mauro Carvalho Chehab, Eric Biggers, Dave Chinner, Eric Sandeen

On 15/03/2020 06.46, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
>>
>>   1) When a filesystem with too many negative dentries is being unmounted,
>>      the process of draining the dentries associated with the filesystem
>>      can take some time. To users, the system may seem to hang for
>>      a while.  The long wait may also cause unexpected timeout error or
>>      other warnings.  This can happen when a long running container with
>>      many negative dentries is being destroyed, for instance.
>>
>>   2) Tying up too much memory in unused negative dentries means there
>>      are less memory available for other use. Even though the kernel is
>>      able to reclaim unused dentries when running out of free memory,
>>      it will still introduce additional latency to the application
>>      reducing its performance.
> 
> There's a third problem, which is that having a lot of negative dentries
> can clog the hash chains.  I tried to quantify this, and found a weird result:

Yep. I've seen this in the wild. Server hard too much unused memory.

https://lore.kernel.org/lkml/ff0993a2-9825-304c-6a5b-2e9d4b940032@yandex-team.ru/T/#u

---quote---

I've seen problem on large server where horde of negative dentries
slowed down all lookups significantly:

watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [atop:968884] at __d_lookup_rcu+0x6f/0x190

slabtop:

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
85118166 85116916   0%    0.19K 2026623       42  16212984K dentry
16577106 16371723   0%    0.10K 425054       39   1700216K buffer_head
935850 934379   0%    1.05K  31195       30    998240K ext4_inode_cache
663740 654967   0%    0.57K  23705       28    379280K radix_tree_node
399987 380055   0%    0.65K   8163       49    261216K proc_inode_cache
226380 168813   0%    0.19K   5390       42     43120K cred_jar
   70345  65721   0%    0.58K   1279       55     40928K inode_cache
105927  43314   0%    0.31K   2077       51     33232K filp
630972 601503   0%    0.04K   6186      102     24744K ext4_extent_status
    5848   4269   0%    3.56K    731        8     23392K task_struct
   16224  11531   0%    1.00K    507       32     16224K kmalloc-1024
    6752   5833   0%    2.00K    422       16     13504K kmalloc-2048
199680 158086   0%    0.06K   3120       64     12480K anon_vma_chain
156128 154751   0%    0.07K   2788       56     11152K Acpi-Operand

Total RAM is 256 GB

These dentries came from temporary files created and deleted by postgres.
But this could be easily reproduced by lookup of non-existent files.

Of course, memory pressure easily washes them away.

Similar problem happened before around proc sysctl entries:
https://lkml.org/lkml/2017/2/10/47

This one does not concentrate in one bucket and needs much more memory.

Looks like dcache needs some kind of background shrinker started
when dcache size or fraction of negative dentries exceeds some threshold.

---end---

> > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.402s
> user	0m4.361s
> sys	0m1.230s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.572s
> user	0m4.337s
> sys	0m1.407s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.607s
> user	0m4.522s
> sys	0m1.342s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.599s
> user	0m4.472s
> sys	0m1.369s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.574s
> user	0m4.498s
> sys	0m1.300s
> 
> Pretty consistent system time, between about 1.3 and 1.4 seconds.
> 
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry             20394  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.515s
> user	0m4.353s
> sys	0m1.359s
> 
> At this point, we have 20k dentries allocated.
> 
> Now, pollute the dcache with names that don't exist:
> 
> root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry             20605  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
> 
> Huh.  We've kept the number of dentries pretty constant.  Still, maybe the
> bad dentries have pushed out the good ones.
> 
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.644s
> user	0m4.921s
> sys	0m1.946s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.676s
> user	0m5.004s
> sys	0m1.909s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.662s
> user	0m4.980s
> sys	0m1.916s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.714s
> user	0m4.973s
> sys	0m1.986s
> 
> Well, we certainly made it suck.  Up to a pretty consistent 1.9-2.0 seconds
> of kernel time, or 50% worse.  We've also made user time worse, somehow.
> 
> Anyhow, I should write a proper C program to measure this.  But I thought
> I'd share this raw data with you now to demonstrate that dcache pollution
> is a real problem today, even on a machine with 2GB.
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-26 16:13 [PATCH 00/11] fs/dcache: Limit # of negative dentries Waiman Long
2020-02-26 16:13 ` [PATCH 01/11] fs/dcache: Fix incorrect accounting " Waiman Long
2020-02-26 16:13 ` [PATCH 02/11] fs/dcache: Simplify __dentry_kill() Waiman Long
2020-02-26 16:13 ` [PATCH 03/11] fs/dcache: Add a counter to track number of children Waiman Long
2020-02-26 16:13 ` [PATCH 04/11] fs/dcache: Add sysctl parameter dentry-dir-max Waiman Long
2020-02-26 16:13 ` [PATCH 05/11] fs/dcache: Reclaim excessive negative dentries in directories Waiman Long
2020-02-26 16:13 ` [PATCH 06/11] fs/dcache: directory opportunistically stores # of positive dentries Waiman Long
2020-02-26 16:14 ` [PATCH 07/11] fs/dcache: Add static key negative_reclaim_enable Waiman Long
2020-02-26 16:14 ` [PATCH 08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn() Waiman Long
2020-02-26 16:14 ` [PATCH 09/11] fs/dcache: Don't allow small values for dentry-dir-max Waiman Long
2020-02-26 16:14 ` [PATCH 10/11] fs/dcache: Kill off dentry as last resort Waiman Long
2020-02-26 16:14 ` [PATCH 11/11] fs/dcache: Track # of negative dentries reclaimed & killed Waiman Long
2020-02-26 16:29 ` [PATCH 00/11] fs/dcache: Limit # of negative dentries Matthew Wilcox
2020-02-26 19:19   ` Waiman Long
2020-02-26 21:28     ` Matthew Wilcox
2020-02-26 21:28   ` Andreas Dilger
2020-02-26 21:45     ` Matthew Wilcox
2020-02-27  8:07       ` Dave Chinner
2020-02-27  9:55     ` Ian Kent
2020-02-28  3:34       ` Matthew Wilcox
2020-02-28  4:16         ` Ian Kent
2020-02-28  4:36           ` Ian Kent
2020-02-28  4:52             ` Al Viro
2020-02-28  4:22         ` Al Viro
2020-02-28  4:52           ` Ian Kent
2020-02-28 15:32         ` Waiman Long
2020-02-28 15:39           ` Matthew Wilcox
2020-02-28 19:32         ` Theodore Y. Ts'o
2020-02-27 19:04   ` Eric Sandeen
2020-02-27 22:39     ` Dave Chinner
2020-02-27  8:30 ` Dave Chinner
2020-02-28 15:47   ` Waiman Long
2020-03-15  3:46 ` Matthew Wilcox
2020-03-21 10:17   ` Konstantin Khlebnikov

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git