All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/8] dcache: increase poison resistance
@ 2020-05-08 12:23 Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state Konstantin Khlebnikov
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

For most filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory.

Getting this memory back ins't that easy because slab frees pages only when
all related objects are gone. While dcache shrinker works in LRU order.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names.

Simple lookup of random names also generates negative dentries very fast.
Constant flow of such negative denries drains all other inactive caches.

Negative dentries are linked into siblings list along with normal positive
dentries. Some operations walks dcache tree but looks only for positive
dentries: most important is fsnotify/inotify. Hordes of negative dentries
slow down these operations significantly.

Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.

This patch set solves all of these problems:

Move negative denries to the end of sliblings list, thus walkers could
skip them at first sight (patches 3-6).

Keep in dcache at most three unreferenced negative denties in row in each
hash bucket (patches 7-8).

---

Konstantin Khlebnikov (8):
      dcache: show count of hash buckets in sysctl fs.dentry-state
      selftests: add stress testing tool for dcache
      dcache: sweep cached negative dentries to the end of list of siblings
      fsnotify: stop walking child dentries if remaining tail is negative
      dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
      dcache: stop walking siblings if remaining dentries all negative
      dcache: push releasing dentry lock into sweep_negative
      dcache: prevent flooding with negative dentries


 fs/dcache.c                                   | 144 +++++++++++-
 fs/libfs.c                                    |  10 +-
 fs/notify/fsnotify.c                          |   6 +-
 include/linux/dcache.h                        |   6 +
 tools/testing/selftests/filesystems/Makefile  |   1 +
 .../selftests/filesystems/dcache_stress.c     | 210 ++++++++++++++++++
 6 files changed, 370 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c

--
Signature

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 14:49   ` Waiman Long
  2020-05-08 12:23 ` [PATCH RFC 2/8] selftests: add stress testing tool for dcache Konstantin Khlebnikov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

Count of buckets is required for estimating average length of hash chains.
Size of hash table depends on memory size and printed once at boot.

Let's expose nr_buckets as sixth number in sysctl fs.dentry-state

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 Documentation/admin-guide/sysctl/fs.rst |   12 ++++++------
 fs/dcache.c                             |   12 ++++++++++--
 include/linux/dcache.h                  |    2 +-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..b74df4714ddd 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -66,12 +66,12 @@ dentry-state
 From linux/include/linux/dcache.h::
 
   struct dentry_stat_t dentry_stat {
-        int nr_dentry;
-        int nr_unused;
-        int age_limit;         /* age in seconds */
-        int want_pages;        /* pages requested by system */
-        int nr_negative;       /* # of unused negative dentries */
-        int dummy;             /* Reserved for future use */
+        long nr_dentry;
+        long nr_unused;
+        long age_limit;         /* age in seconds */
+        long want_pages;        /* pages requested by system */
+        long nr_negative;       /* # of unused negative dentries */
+        long nr_buckets;        /* count of dcache hash buckets */
   };
 
 Dentries are dynamically allocated and deallocated.
diff --git a/fs/dcache.c b/fs/dcache.c
index b280e07e162b..386f97eaf2ff 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3139,6 +3139,14 @@ static int __init set_dhash_entries(char *str)
 }
 __setup("dhash_entries=", set_dhash_entries);
 
+static void __init dcache_init_hash(void)
+{
+	dentry_stat.nr_buckets = 1l << d_hash_shift;
+
+	/* shift to use higher bits of 32 bit hash value */
+	d_hash_shift = 32 - d_hash_shift;
+}
+
 static void __init dcache_init_early(void)
 {
 	/* If hashes are distributed across NUMA nodes, defer
@@ -3157,7 +3165,7 @@ static void __init dcache_init_early(void)
 					NULL,
 					0,
 					0);
-	d_hash_shift = 32 - d_hash_shift;
+	dcache_init_hash();
 }
 
 static void __init dcache_init(void)
@@ -3185,7 +3193,7 @@ static void __init dcache_init(void)
 					NULL,
 					0,
 					0);
-	d_hash_shift = 32 - d_hash_shift;
+	dcache_init_hash();
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c1488cc84fd9..082b55068e4d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -65,7 +65,7 @@ struct dentry_stat_t {
 	long age_limit;		/* age in seconds */
 	long want_pages;	/* pages requested by system */
 	long nr_negative;	/* # of unused negative dentries */
-	long dummy;		/* Reserved for future use */
+	long nr_buckets;	/* count of dcache hash buckets */
 };
 extern struct dentry_stat_t dentry_stat;
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 2/8] selftests: add stress testing tool for dcache
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-13  1:52   ` Dave Chinner
  2020-05-08 12:23 ` [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings Konstantin Khlebnikov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

This tool fills dcache with negative dentries. Between iterations it prints
statistics and measures time of inotify operation which might degrade.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 tools/testing/selftests/filesystems/Makefile       |    1 
 .../testing/selftests/filesystems/dcache_stress.c  |  210 ++++++++++++++++++++
 2 files changed, 211 insertions(+)
 create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c

diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 129880fb42d3..6b5e08617d11 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -3,5 +3,6 @@
 CFLAGS += -I../../../../usr/include/
 TEST_GEN_PROGS := devpts_pts
 TEST_GEN_PROGS_EXTENDED := dnotify_test
+TEST_GEN_FILES += dcache_stress
 
 include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/dcache_stress.c b/tools/testing/selftests/filesystems/dcache_stress.c
new file mode 100644
index 000000000000..770e8876629e
--- /dev/null
+++ b/tools/testing/selftests/filesystems/dcache_stress.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/inotify.h>
+#include <sys/stat.h>
+#include <time.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <string.h>
+#include <err.h>
+
+double now(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec + ts.tv_nsec * 1e-9;
+}
+
+struct dentry_stat {
+	long nr_dentry;
+	long nr_unused;
+	long age_limit;		/* age in seconds */
+	long want_pages;	/* pages requested by system */
+	long nr_negative;	/* # of unused negative dentries */
+	long nr_buckets;	/* count of dcache hash buckets */
+};
+
+void show_dentry_state(void)
+{
+	struct dentry_stat stat;
+	ssize_t len;
+	FILE *f;
+
+	f = fopen("/proc/sys/fs/dentry-state", "r");
+	if (!f)
+		err(2, "open fs.dentry-state");
+
+	if (fscanf(f, "%ld %ld %ld %ld %ld %ld",
+		   &stat.nr_dentry,
+		   &stat.nr_unused,
+		   &stat.age_limit,
+		   &stat.want_pages,
+		   &stat.nr_negative,
+		   &stat.nr_buckets) != 6)
+		err(2, "read fs.dentry-state");
+	fclose(f);
+
+	if (!stat.nr_buckets)
+		stat.nr_buckets = 1 << 20;	// for 8Gb ram
+
+	printf("nr_dentry = %ld\t%.1fM\n", stat.nr_dentry, stat.nr_dentry / 1e6);
+	printf("nr_buckets = %ld\t%.1f avg\n", stat.nr_buckets, (double)stat.nr_dentry / stat.nr_buckets);
+	printf("nr_unused = %ld\t%.1f%%\n", stat.nr_unused, stat.nr_unused * 100. / stat.nr_dentry);
+	printf("nr_negative = %ld\t%.1f%%\n\n", stat.nr_negative, stat.nr_negative * 100. / stat.nr_dentry);
+}
+
+void test_inotify(const char *path)
+{
+	double tm;
+	int fd;
+
+	fd = inotify_init1(0);
+
+	tm = now();
+	inotify_add_watch(fd, path, IN_OPEN);
+	tm = now() - tm;
+
+	printf("inotify time: %f seconds\n\n", tm);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	char dir_name[] = "dcache_stress.XXXXXX";
+	char name[4096];
+	char *suffix = name;
+	int nr_iterations = 10;
+	int nr_names = 1 << 20;
+	int iteration, index;
+	int other_dir = -1;
+	int mknod_unlink = 0;
+	int mkdir_chdir = 0;
+	int second_access = 0;
+	long long total_names = 0;
+	double tm;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "i:n:p:o:usdh")) != -1) {
+		switch (opt) {
+		case 'i':
+			nr_iterations = atoi(optarg);
+			break;
+		case 'n':
+			nr_names = atoi(optarg);
+			break;
+		case 'p':
+			strcpy(suffix, optarg);
+			suffix += strlen(suffix);
+			break;
+		case 'o':
+			other_dir = open(optarg, O_RDONLY | O_DIRECTORY);
+			if (other_dir < 0)
+				err(2, "open %s", optarg);
+			break;
+		case 'u':
+			mknod_unlink = 1;
+			break;
+		case 'd':
+			mkdir_chdir = 1;
+			break;
+		case 's':
+			second_access = 1;
+			break;
+		case '?':
+		case 'h':
+			printf("usage: %s [-i <iterations>] [-n <names>] [-p <prefix>] [-o <dir>] [-u] [-s]\n"
+			       "  -i  test iterations, default %d\n"
+			       "  -n  names at each iterations, default %d\n"
+			       "  -p  prefix for names\n"
+			       "  -o  interlave with other dir\n"
+			       "  -s  touch twice\n"
+			       "  -u  mknod-unlink sequence\n"
+			       "  -d  mkdir-chdir sequence (leaves garbage)\n",
+			       argv[0], nr_iterations, nr_names);
+			return 1;
+		}
+	}
+
+
+	if (!mkdtemp(dir_name))
+		err(2, "mkdtemp");
+
+	if (chdir(dir_name))
+		err(2, "chdir");
+
+	show_dentry_state();
+
+	if (!mkdir_chdir)
+		test_inotify(".");
+
+	printf("working in temporary directory %s\n\n", dir_name);
+
+	for (iteration = 1; iteration <= nr_iterations; iteration++) {
+
+		printf("start iteration %d, %d names\n", iteration, nr_names);
+
+		tm = now();
+
+		sprintf(suffix, "%08x", iteration);
+
+		for (index = 0; index < nr_names; index++) {
+			sprintf(suffix + 8, "%08x", index);
+
+			if (mknod_unlink) {
+				if (mknod(name, S_IFREG, 0))
+					err(2, "mknod %s", name);
+				if (unlink(name))
+					err(2, "unlink %s", name);
+			} else if (mkdir_chdir) {
+				if (mkdir(name, 0775))
+					err(2, "mkdir %s", name);
+				if (chdir(name))
+					err(2, "chdir %s", name);
+			} else
+				access(name, 0);
+
+			if (second_access)
+				access(name, 0);
+
+			if (other_dir >= 0) {
+				faccessat(other_dir, name, 0, 0);
+				if (second_access)
+					faccessat(other_dir, name, 0, 0);
+			}
+		}
+
+		total_names += nr_names;
+
+		tm = now() - tm;
+		printf("iteration %d complete in %f seconds, total names %lld\n\n", iteration, tm, total_names);
+
+		show_dentry_state();
+
+		if (!mkdir_chdir)
+			test_inotify(".");
+	}
+
+	if (chdir(".."))
+		err(2, "chdir");
+
+	if (mkdir_chdir) {
+		printf("leave temporary directory %s\n", dir_name);
+		return 0;
+	}
+
+	printf("removing temporary directory %s\n", dir_name);
+	tm = now();
+	if (rmdir(dir_name))
+		err(2, "rmdir");
+	tm = now() - tm;
+	printf("remove complete in %f seconds\n\n", tm);
+
+	show_dentry_state();
+
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 2/8] selftests: add stress testing tool for dcache Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 19:38   ` Waiman Long
  2020-05-08 12:23 ` [PATCH RFC 4/8] fsnotify: stop walking child dentries if remaining tail is negative Konstantin Khlebnikov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

For disk filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory. Negative dentries are linked into siblings list
along with normal positive dentries. Some operations walks dcache tree but
looks only for positive dentries: most important is fsnotify/inotify.

This patch moves negative dentries to the end of list at final dput() and
marks with flag which tells that all following dentries are negative too.
Reverse operation is required before instantiating negative dentry.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/dcache.c            |   63 ++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |    6 +++++
 2 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 386f97eaf2ff..743255773cc7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 	return __lock_parent(dentry);
 }
 
+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	if (!d_is_tail_negative(dentry)) {
+		parent = lock_parent(dentry);
+		if (!parent)
+			return;
+
+		if (!d_count(dentry) && d_is_negative(dentry) &&
+		    !d_is_tail_negative(dentry)) {
+			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+			list_move_tail(&dentry->d_child, &parent->d_subdirs);
+		}
+
+		spin_unlock(&parent->d_lock);
+	}
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	spin_lock(&dentry->d_lock);
+	parent = lock_parent(dentry);
+	if (parent) {
+		list_move(&dentry->d_child, &parent->d_subdirs);
+		spin_unlock(&parent->d_lock);
+	}
+	dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+	spin_unlock(&dentry->d_lock);
+}
+
 static inline bool retain_dentry(struct dentry *dentry)
 {
 	WARN_ON(d_in_lookup(dentry));
@@ -703,6 +745,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 		spin_unlock(&inode->i_lock);
 	if (parent)
 		spin_unlock(&parent->d_lock);
+	if (d_is_negative(dentry))
+		sweep_negative(dentry);
 	spin_unlock(&dentry->d_lock);
 	return NULL;
 }
@@ -718,7 +762,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 static inline bool fast_dput(struct dentry *dentry)
 {
 	int ret;
-	unsigned int d_flags;
+	unsigned int d_flags, required;
 
 	/*
 	 * If we have a d_op->d_delete() operation, we sould not
@@ -766,6 +810,8 @@ static inline bool fast_dput(struct dentry *dentry)
 	 * a 'delete' op, and it's referenced and already on
 	 * the LRU list.
 	 *
+	 * Cached negative dentry must be swept to the tail.
+	 *
 	 * NOTE! Since we aren't locked, these values are
 	 * not "stable". However, it is sufficient that at
 	 * some point after we dropped the reference the
@@ -777,10 +823,15 @@ static inline bool fast_dput(struct dentry *dentry)
 	 */
 	smp_rmb();
 	d_flags = READ_ONCE(dentry->d_flags);
-	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
+
+	required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		   (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
+	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		   DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;
 
 	/* Nothing to do? Dropping the reference was all we needed? */
-	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+	if (d_flags == required && !d_unhashed(dentry))
 		return true;
 
 	/*
@@ -852,6 +903,8 @@ void dput(struct dentry *dentry)
 		rcu_read_unlock();
 
 		if (likely(retain_dentry(dentry))) {
+			if (d_is_negative(dentry))
+				sweep_negative(dentry);
 			spin_unlock(&dentry->d_lock);
 			return;
 		}
@@ -1951,6 +2004,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 	if (inode) {
+		if (d_is_tail_negative(entry))
+			recycle_negative(entry);
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
 		__d_instantiate(entry, inode);
@@ -1970,6 +2025,8 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 	BUG_ON(!inode);
 	lockdep_annotate_inode_mutex_key(inode);
+	if (d_is_tail_negative(entry))
+		recycle_negative(entry);
 	security_d_instantiate(entry, inode);
 	spin_lock(&inode->i_lock);
 	__d_instantiate(entry, inode);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 082b55068e4d..1127a394b931 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -217,6 +217,7 @@ struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE		0x80000000 /* All following siblings are negative */
 
 extern seqlock_t rename_lock;
 
@@ -493,6 +494,11 @@ static inline int simple_positive(const struct dentry *dentry)
 	return d_really_is_positive(dentry) && !d_unhashed(dentry);
 }
 
+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+	return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
 extern void d_set_fallthru(struct dentry *dentry);
 
 static inline bool d_is_fallthru(const struct dentry *dentry)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 4/8] fsnotify: stop walking child dentries if remaining tail is negative
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 5/8] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Konstantin Khlebnikov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

When notification starts/stops listening events from inode's children it
have to update dentry->d_flags of all positive child dentries. Scanning
may took a long time if directory has a lot of negative child dentries.

This is main beneficiary of sweeping cached negative dentries to the end.

Before patch:

nr_dentry = 24172597	24.2M
nr_buckets = 8388608	2.9 avg
nr_unused = 24158110	99.9%
nr_negative = 24142810	99.9%

inotify time: 0.507182 seconds

After patch:

nr_dentry = 24562747    24.6M
nr_buckets = 8388608    2.9 avg
nr_unused = 24548714    99.9%
nr_negative = 24543867  99.9%

inotify time: 0.000010 seconds

Negative dentries no longer slow down inotify op at parent directory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/notify/fsnotify.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 72d332ce8e12..072974302950 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -127,8 +127,12 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		 * original inode) */
 		spin_lock(&alias->d_lock);
 		list_for_each_entry(child, &alias->d_subdirs, d_child) {
-			if (!child->d_inode)
+			if (!child->d_inode) {
+				/* all remaining children are negative */
+				if (d_is_tail_negative(child))
+					break;
 				continue;
+			}
 
 			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 			if (watched)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 5/8] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (3 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 4/8] fsnotify: stop walking child dentries if remaining tail is negative Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 6/8] dcache: stop walking siblings if remaining dentries all negative Konstantin Khlebnikov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

This lets skip remaining siblings at seeing d_is_tail_negative().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/dcache.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 743255773cc7..44c6832d21d6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1303,12 +1303,14 @@ EXPORT_SYMBOL(shrink_dcache_sb);
  * @D_WALK_QUIT:	quit walk
  * @D_WALK_NORETRY:	quit when retry is needed
  * @D_WALK_SKIP:	skip this dentry and its children
+ * @D_WALK_SKIP_SIBLINGS: skip siblings and their children
  */
 enum d_walk_ret {
 	D_WALK_CONTINUE,
 	D_WALK_QUIT,
 	D_WALK_NORETRY,
 	D_WALK_SKIP,
+	D_WALK_SKIP_SIBLINGS,
 };
 
 /**
@@ -1339,6 +1341,7 @@ static void d_walk(struct dentry *parent, void *data,
 		break;
 	case D_WALK_QUIT:
 	case D_WALK_SKIP:
+	case D_WALK_SKIP_SIBLINGS:
 		goto out_unlock;
 	case D_WALK_NORETRY:
 		retry = false;
@@ -1370,6 +1373,9 @@ static void d_walk(struct dentry *parent, void *data,
 		case D_WALK_SKIP:
 			spin_unlock(&dentry->d_lock);
 			continue;
+		case D_WALK_SKIP_SIBLINGS:
+			spin_unlock(&dentry->d_lock);
+			goto skip_siblings;
 		}
 
 		if (!list_empty(&dentry->d_subdirs)) {
@@ -1381,6 +1387,7 @@ static void d_walk(struct dentry *parent, void *data,
 		}
 		spin_unlock(&dentry->d_lock);
 	}
+skip_siblings:
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 6/8] dcache: stop walking siblings if remaining dentries all negative
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (4 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 5/8] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 7/8] dcache: push releasing dentry lock into sweep_negative Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

Most walkers are interested only in positive dentries.

Changes in simple_* libfs helpers are mostly cosmetic: it shouldn't cache
negative dentries unless uses d_delete other than always_delete_dentry().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/dcache.c |   10 ++++++++++
 fs/libfs.c  |   10 +++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 44c6832d21d6..0fd2e02e507b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1442,6 +1442,9 @@ static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry)
 	struct check_mount *info = data;
 	struct path path = { .mnt = info->mnt, .dentry = dentry };
 
+	if (d_is_tail_negative(dentry))
+		return D_WALK_SKIP_SIBLINGS;
+
 	if (likely(!d_mountpoint(dentry)))
 		return D_WALK_CONTINUE;
 	if (__path_is_mountpoint(&path)) {
@@ -1688,6 +1691,10 @@ void shrink_dcache_for_umount(struct super_block *sb)
 static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
 {
 	struct dentry **victim = _data;
+
+	if (d_is_tail_negative(dentry))
+		return D_WALK_SKIP_SIBLINGS;
+
 	if (d_mountpoint(dentry)) {
 		__dget_dlock(dentry);
 		*victim = dentry;
@@ -3159,6 +3166,9 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
 {
 	struct dentry *root = data;
 	if (dentry != root) {
+		if (d_is_tail_negative(dentry))
+			return D_WALK_SKIP_SIBLINGS;
+
 		if (d_unhashed(dentry) || !dentry->d_inode)
 			return D_WALK_SKIP;
 
diff --git a/fs/libfs.c b/fs/libfs.c
index 3759fbacf522..de944c241cf0 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -106,6 +106,10 @@ static struct dentry *scan_positives(struct dentry *cursor,
 	spin_lock(&dentry->d_lock);
 	while ((p = p->next) != &dentry->d_subdirs) {
 		struct dentry *d = list_entry(p, struct dentry, d_child);
+
+		if (d_is_tail_negative(d))
+			break;
+
 		// we must at least skip cursors, to avoid livelocks
 		if (d->d_flags & DCACHE_DENTRY_CURSOR)
 			continue;
@@ -255,7 +259,8 @@ static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev
 			spin_unlock(&d->d_lock);
 			if (likely(child))
 				break;
-		}
+		} else if (d_is_tail_negative(d))
+			break;
 	}
 	spin_unlock(&parent->d_lock);
 	dput(prev);
@@ -408,6 +413,9 @@ int simple_empty(struct dentry *dentry)
 
 	spin_lock(&dentry->d_lock);
 	list_for_each_entry(child, &dentry->d_subdirs, d_child) {
+		if (d_is_tail_negative(child))
+			break;
+
 		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 		if (simple_positive(child)) {
 			spin_unlock(&child->d_lock);


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 7/8] dcache: push releasing dentry lock into sweep_negative
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (5 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 6/8] dcache: stop walking siblings if remaining dentries all negative Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 12:23 ` [PATCH RFC 8/8] dcache: prevent flooding with negative dentries Konstantin Khlebnikov
  2020-12-09 23:01 ` [PATCH RFC 0/8] dcache: increase poison resistance Junxiao Bi
  8 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

This is preparation for the next patch.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/dcache.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0fd2e02e507b..60158065891e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -636,15 +636,17 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
  * Move cached negative dentry to the tail of parent->d_subdirs.
  * This lets walkers skip them all together at first sight.
  * Must be called at dput of negative dentry.
+ * dentry->d_lock must be held, returns with it unlocked.
  */
 static void sweep_negative(struct dentry *dentry)
+	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
 
 	if (!d_is_tail_negative(dentry)) {
 		parent = lock_parent(dentry);
 		if (!parent)
-			return;
+			goto out;
 
 		if (!d_count(dentry) && d_is_negative(dentry) &&
 		    !d_is_tail_negative(dentry)) {
@@ -654,6 +656,8 @@ static void sweep_negative(struct dentry *dentry)
 
 		spin_unlock(&parent->d_lock);
 	}
+out:
+	spin_unlock(&dentry->d_lock);
 }
 
 /*
@@ -747,7 +751,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 		spin_unlock(&parent->d_lock);
 	if (d_is_negative(dentry))
 		sweep_negative(dentry);
-	spin_unlock(&dentry->d_lock);
+	else
+		spin_unlock(&dentry->d_lock);
 	return NULL;
 }
 
@@ -905,7 +910,8 @@ void dput(struct dentry *dentry)
 		if (likely(retain_dentry(dentry))) {
 			if (d_is_negative(dentry))
 				sweep_negative(dentry);
-			spin_unlock(&dentry->d_lock);
+			else
+				spin_unlock(&dentry->d_lock);
 			return;
 		}
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 8/8] dcache: prevent flooding with negative dentries
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (6 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 7/8] dcache: push releasing dentry lock into sweep_negative Konstantin Khlebnikov
@ 2020-05-08 12:23 ` Konstantin Khlebnikov
  2020-05-08 14:56   ` Matthew Wilcox
  2020-05-08 21:07   ` Waiman Long
  2020-12-09 23:01 ` [PATCH RFC 0/8] dcache: increase poison resistance Junxiao Bi
  8 siblings, 2 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 12:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro; +Cc: Waiman Long

Without memory pressure count of negative dentries isn't bounded.
They could consume all memory and drain all other inactive caches.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names. Reclaiming them took
some time because slab frees pages only when all related objects are gone.
Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.
Simple lookup of random names also generates negative dentries very fast.

This patch implements heuristic which detects such scenarios and prevents
unbounded growth of completely unneeded negative dentries. It keeps up to
three latest negative dentry in each bucket unless they were referenced.

At first dput of negative dentry when it swept to the tail of siblings
we'll also clear it's reference flag and look at next dentries in chain.
Then kill third in series of negative, unused and unreferenced denries.

This way each hash bucket will preserve three negative dentry to let them
get reference and survive. Adding positive or used dentry into hash chain
also protects few recent negative dentries. In result total size of dcache
asymptotically limited by count of buckets and positive or used dentries.

Before patch: tool 'dcache_stress' could fill entire memory with dentries.

nr_dentry = 104913261   104.9M
nr_buckets = 8388608    12.5 avg
nr_unused = 104898729   100.0%
nr_negative = 104883218 100.0%

After this patch count of dentries saturates at around 3 per bucket:

nr_dentry = 24619259    24.6M
nr_buckets = 8388608    2.9 avg
nr_unused = 24605226    99.9%
nr_negative = 24600351  99.9%

This heuristic isn't bulletproof and solves only most practical case.
It's easy to deceive: just touch same random name twice.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/dcache.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 60158065891e..9f3d331b4978 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 	return __lock_parent(dentry);
 }
 
+/*
+ * Called at first dput of each negative dentry.
+ * Prevents filling cache with never reused negative dentries.
+ *
+ * This clears reference and then looks at following dentries in hash chain.
+ * If they are negative, unused and unreferenced then keep two and kill third.
+ */
+static void trim_negative(struct dentry *dentry)
+	__releases(dentry->d_lock)
+{
+	struct dentry *victim, *parent;
+	struct hlist_bl_node *next;
+	int keep = 2;
+
+	rcu_read_lock();
+
+	dentry->d_flags &= ~DCACHE_REFERENCED;
+	spin_unlock(&dentry->d_lock);
+
+	next = rcu_dereference_raw(dentry->d_hash.next);
+	while (1) {
+		victim = hlist_bl_entry(next, struct dentry, d_hash);
+
+		if (!next || d_count(victim) || !d_is_negative(victim) ||
+		    (victim->d_flags & DCACHE_REFERENCED)) {
+			rcu_read_unlock();
+			return;
+		}
+
+		if (!keep--)
+			break;
+
+		next = rcu_dereference_raw(next->next);
+	}
+
+	spin_lock(&victim->d_lock);
+	parent = lock_parent(victim);
+
+	rcu_read_unlock();
+
+	if (d_count(victim) || !d_is_negative(victim) ||
+	    (victim->d_flags & DCACHE_REFERENCED)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		spin_unlock(&victim->d_lock);
+		return;
+	}
+
+	__dentry_kill(victim);
+	dput(parent);
+}
+
 /*
  * Move cached negative dentry to the tail of parent->d_subdirs.
  * This lets walkers skip them all together at first sight.
@@ -655,6 +707,8 @@ static void sweep_negative(struct dentry *dentry)
 		}
 
 		spin_unlock(&parent->d_lock);
+
+		return trim_negative(dentry);
 	}
 out:
 	spin_unlock(&dentry->d_lock);


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 12:23 ` [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state Konstantin Khlebnikov
@ 2020-05-08 14:49   ` Waiman Long
  2020-05-08 16:16     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2020-05-08 14:49 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> Count of buckets is required for estimating average length of hash chains.
> Size of hash table depends on memory size and printed once at boot.
>
> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state

The hash bucket count is a constant determined at boot time. Is there a 
need to use up one dentry_stat entry for that? Besides one can get it by 
looking up the kernel dmesg log like:

[    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 
67108864 bytes)

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries
  2020-05-08 12:23 ` [PATCH RFC 8/8] dcache: prevent flooding with negative dentries Konstantin Khlebnikov
@ 2020-05-08 14:56   ` Matthew Wilcox
  2020-05-08 16:29     ` Konstantin Khlebnikov
  2020-05-08 21:07   ` Waiman Long
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-05-08 14:56 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Waiman Long

On Fri, May 08, 2020 at 03:23:33PM +0300, Konstantin Khlebnikov wrote:
> This patch implements heuristic which detects such scenarios and prevents
> unbounded growth of completely unneeded negative dentries. It keeps up to
> three latest negative dentry in each bucket unless they were referenced.
> 
> At first dput of negative dentry when it swept to the tail of siblings
> we'll also clear it's reference flag and look at next dentries in chain.
> Then kill third in series of negative, unused and unreferenced denries.
> 
> This way each hash bucket will preserve three negative dentry to let them
> get reference and survive. Adding positive or used dentry into hash chain
> also protects few recent negative dentries. In result total size of dcache
> asymptotically limited by count of buckets and positive or used dentries.
> 
> This heuristic isn't bulletproof and solves only most practical case.
> It's easy to deceive: just touch same random name twice.

I'm not sure if that's "easy to deceive" ... My concern with limiting
negative dentries is something like a kernel compilation where there
are many (11 for mm/mmap.c, 9 in general) and there will be a lot of
places where <linux/fs.h> does not exist

-isystem /usr/lib/gcc/x86_64-linux-gnu/9/include
-I../arch/x86/include
-I./arch/x86/include/generated
-I../include
-I./include
-I../arch/x86/include/uapi
-I./arch/x86/include/generated/uapi
-I../include/uapi
-I./include/generated/uapi
-I ../mm
-I ./mm

So it'd be good to know that kernel compilation times are unaffected by
this patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 14:49   ` Waiman Long
@ 2020-05-08 16:16     ` Konstantin Khlebnikov
  2020-05-08 19:05       ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 16:16 UTC (permalink / raw)
  To: Waiman Long, linux-kernel, linux-fsdevel, linux-mm, Alexander Viro

On 08/05/2020 17.49, Waiman Long wrote:
> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>> Count of buckets is required for estimating average length of hash chains.
>> Size of hash table depends on memory size and printed once at boot.
>>
>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
> 
> The hash bucket count is a constant determined at boot time. Is there a need to use up one dentry_stat entry for that? Besides one can get 
> it by looking up the kernel dmesg log like:
> 
> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)

Grepping logs since boot time is a worst API ever.

dentry-state shows count of dentries in various states.
It's very convenient to show count of buckets next to it,
because this number defines overall scale.

> 
> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries
  2020-05-08 14:56   ` Matthew Wilcox
@ 2020-05-08 16:29     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 16:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Waiman Long

On 08/05/2020 17.56, Matthew Wilcox wrote:
> On Fri, May 08, 2020 at 03:23:33PM +0300, Konstantin Khlebnikov wrote:
>> This patch implements heuristic which detects such scenarios and prevents
>> unbounded growth of completely unneeded negative dentries. It keeps up to
>> three latest negative dentry in each bucket unless they were referenced.
>>
>> At first dput of negative dentry when it swept to the tail of siblings
>> we'll also clear it's reference flag and look at next dentries in chain.
>> Then kill third in series of negative, unused and unreferenced denries.
>>
>> This way each hash bucket will preserve three negative dentry to let them
>> get reference and survive. Adding positive or used dentry into hash chain
>> also protects few recent negative dentries. In result total size of dcache
>> asymptotically limited by count of buckets and positive or used dentries.
>>
>> This heuristic isn't bulletproof and solves only most practical case.
>> It's easy to deceive: just touch same random name twice.
> 
> I'm not sure if that's "easy to deceive" ... My concern with limiting
> negative dentries is something like a kernel compilation where there
> are many (11 for mm/mmap.c, 9 in general) and there will be a lot of
> places where <linux/fs.h> does not exist
> 
> -isystem /usr/lib/gcc/x86_64-linux-gnu/9/include
> -I../arch/x86/include
> -I./arch/x86/include/generated
> -I../include
> -I./include
> -I../arch/x86/include/uapi
> -I./arch/x86/include/generated/uapi
> -I../include/uapi
> -I./include/generated/uapi
> -I ../mm
> -I ./mm
> 
> So it'd be good to know that kernel compilation times are unaffected by
> this patch.
> 

It's very unlikely that this patches changes anything for compilation.
Or any other scenario with sane amount and rate of appearing new names.

This trims only dentries which never been accessed twice.
Keeping 3 dentries per bucket gives high chances that all of them get
reference bit and stay in cache until shrinker bury them.

To get false positive in this heuristic - multiple newly created
negative dentries must hit one bucket in short period of time.
I.e. at least three hash collisions is required.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 16:16     ` Konstantin Khlebnikov
@ 2020-05-08 19:05       ` Waiman Long
  2020-05-08 19:38         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2020-05-08 19:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
> On 08/05/2020 17.49, Waiman Long wrote:
>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>> Count of buckets is required for estimating average length of hash 
>>> chains.
>>> Size of hash table depends on memory size and printed once at boot.
>>>
>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>
>> The hash bucket count is a constant determined at boot time. Is there 
>> a need to use up one dentry_stat entry for that? Besides one can get 
>> it by looking up the kernel dmesg log like:
>>
>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 
>> 67108864 bytes)
>
> Grepping logs since boot time is a worst API ever.
>
> dentry-state shows count of dentries in various states.
> It's very convenient to show count of buckets next to it,
> because this number defines overall scale. 

I am not against using the last free entry for that. My only concern is 
when we want to expose another internal dcache data point via 
dentry-state, we will have to add one more number to the array which can 
cause all sort of compatibility problem. So do we want to use the last 
free slot for a constant that can be retrieved from somewhere else?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 19:05       ` Waiman Long
@ 2020-05-08 19:38         ` Konstantin Khlebnikov
  2020-05-08 20:00           ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-08 19:38 UTC (permalink / raw)
  To: Waiman Long, linux-kernel, linux-fsdevel, linux-mm, Alexander Viro



On 08/05/2020 22.05, Waiman Long wrote:
> On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
>> On 08/05/2020 17.49, Waiman Long wrote:
>>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>>> Count of buckets is required for estimating average length of hash chains.
>>>> Size of hash table depends on memory size and printed once at boot.
>>>>
>>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>>
>>> The hash bucket count is a constant determined at boot time. Is there a need to use up one dentry_stat entry for that? Besides one can 
>>> get it by looking up the kernel dmesg log like:
>>>
>>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
>>
>> Grepping logs since boot time is a worst API ever.
>>
>> dentry-state shows count of dentries in various states.
>> It's very convenient to show count of buckets next to it,
>> because this number defines overall scale. 
> 
> I am not against using the last free entry for that. My only concern is when we want to expose another internal dcache data point via 
> dentry-state, we will have to add one more number to the array which can cause all sort of compatibility problem. So do we want to use the 
> last free slot for a constant that can be retrieved from somewhere else?

I see no problem in adding more numbers into sysctl.
Especially into such rarely used.
This interface is designed for that.

Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0

> 
> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings
  2020-05-08 12:23 ` [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings Konstantin Khlebnikov
@ 2020-05-08 19:38   ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-05-08 19:38 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> For disk filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory. Negative dentries are linked into siblings list
> along with normal positive dentries. Some operations walks dcache tree but
> looks only for positive dentries: most important is fsnotify/inotify.
>
> This patch moves negative dentries to the end of list at final dput() and
> marks with flag which tells that all following dentries are negative too.
> Reverse operation is required before instantiating negative dentry.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   fs/dcache.c            |   63 ++++++++++++++++++++++++++++++++++++++++++++++--
>   include/linux/dcache.h |    6 +++++
>   2 files changed, 66 insertions(+), 3 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 386f97eaf2ff..743255773cc7 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
>   	return __lock_parent(dentry);
>   }
>   
> +/*
> + * Move cached negative dentry to the tail of parent->d_subdirs.
> + * This lets walkers skip them all together at first sight.
> + * Must be called at dput of negative dentry.
> + */
> +static void sweep_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	if (!d_is_tail_negative(dentry)) {
> +		parent = lock_parent(dentry);
> +		if (!parent)
> +			return;
> +
> +		if (!d_count(dentry) && d_is_negative(dentry) &&
> +		    !d_is_tail_negative(dentry)) {
> +			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> +			list_move_tail(&dentry->d_child, &parent->d_subdirs);
> +		}
> +
> +		spin_unlock(&parent->d_lock);
> +	}
> +}
> +
> +/*
> + * Undo sweep_negative() and move to the head of parent->d_subdirs.
> + * Must be called before converting negative dentry into positive.
> + */
> +static void recycle_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	spin_lock(&dentry->d_lock);
> +	parent = lock_parent(dentry);
> +	if (parent) {
> +		list_move(&dentry->d_child, &parent->d_subdirs);
> +		spin_unlock(&parent->d_lock);
> +	}
> +	dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +

The name sweep_negative and recycle_negative are not very descriptive of 
what the function is doing (especially the later one). I am not good at 
naming, but some kind of naming scheme that can clearly show one is 
opposite of another will be better.


>   static inline bool retain_dentry(struct dentry *dentry)
>   {
>   	WARN_ON(d_in_lookup(dentry));
> @@ -703,6 +745,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>   		spin_unlock(&inode->i_lock);
>   	if (parent)
>   		spin_unlock(&parent->d_lock);
> +	if (d_is_negative(dentry))
> +		sweep_negative(dentry);
We can potentially save an unneeded lock/unlock pair by moving it up 
before "if (parent)" and pass in a flag to indicate if the parent lock 
is being held.
>   	spin_unlock(&dentry->d_lock);
>   	return NULL;
>   }
> @@ -718,7 +762,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>   static inline bool fast_dput(struct dentry *dentry)
>   {
>   	int ret;
> -	unsigned int d_flags;
> +	unsigned int d_flags, required;
>   
>   	/*
>   	 * If we have a d_op->d_delete() operation, we sould not
> @@ -766,6 +810,8 @@ static inline bool fast_dput(struct dentry *dentry)
>   	 * a 'delete' op, and it's referenced and already on
>   	 * the LRU list.
>   	 *
> +	 * Cached negative dentry must be swept to the tail.
> +	 *
>   	 * NOTE! Since we aren't locked, these values are
>   	 * not "stable". However, it is sufficient that at
>   	 * some point after we dropped the reference the
> @@ -777,10 +823,15 @@ static inline bool fast_dput(struct dentry *dentry)
>   	 */
>   	smp_rmb();
>   	d_flags = READ_ONCE(dentry->d_flags);
> -	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
> +
> +	required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
> +		   (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
> +
> +	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
> +		   DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;
>   
>   	/* Nothing to do? Dropping the reference was all we needed? */
> -	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
> +	if (d_flags == required && !d_unhashed(dentry))
>   		return true;
>   
>   	/*
> @@ -852,6 +903,8 @@ void dput(struct dentry *dentry)
>   		rcu_read_unlock();
>   
>   		if (likely(retain_dentry(dentry))) {
> +			if (d_is_negative(dentry))
> +				sweep_negative(dentry);
>   			spin_unlock(&dentry->d_lock);
>   			return;
>   		}
> @@ -1951,6 +2004,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
>   {
>   	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
>   	if (inode) {
> +		if (d_is_tail_negative(entry))
> +			recycle_negative(entry);

Will it better to recycle_negative() inside __d_instantiate() under the 
d_lock. That reduces the number of lock/unlock you need to do and 
eliminate the need to change d_instantiate_new().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 19:38         ` Konstantin Khlebnikov
@ 2020-05-08 20:00           ` Waiman Long
  2020-05-08 20:03             ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2020-05-08 20:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On 5/8/20 3:38 PM, Konstantin Khlebnikov wrote:
>
>
> On 08/05/2020 22.05, Waiman Long wrote:
>> On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
>>> On 08/05/2020 17.49, Waiman Long wrote:
>>>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>>>> Count of buckets is required for estimating average length of hash 
>>>>> chains.
>>>>> Size of hash table depends on memory size and printed once at boot.
>>>>>
>>>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>>>
>>>> The hash bucket count is a constant determined at boot time. Is 
>>>> there a need to use up one dentry_stat entry for that? Besides one 
>>>> can get it by looking up the kernel dmesg log like:
>>>>
>>>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 
>>>> 67108864 bytes)
>>>
>>> Grepping logs since boot time is a worst API ever.
>>>
>>> dentry-state shows count of dentries in various states.
>>> It's very convenient to show count of buckets next to it,
>>> because this number defines overall scale. 
>>
>> I am not against using the last free entry for that. My only concern 
>> is when we want to expose another internal dcache data point via 
>> dentry-state, we will have to add one more number to the array which 
>> can cause all sort of compatibility problem. So do we want to use the 
>> last free slot for a constant that can be retrieved from somewhere else?
>
> I see no problem in adding more numbers into sysctl.
> Especially into such rarely used.
> This interface is designed for that.
>
> Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0

Well, I got rebuke the last time I want to reuse one of 
age_limit/want_pages entry for negative dentry count because of the 
potential of breaking some really old applications or tools. Changing 
dentry-state to output one more number can potentially break 
compatibility too. That is why I am questioning if it is a good idea to 
use up the last free slot.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state
  2020-05-08 20:00           ` Waiman Long
@ 2020-05-08 20:03             ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2020-05-08 20:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On Fri, May 08, 2020 at 04:00:08PM -0400, Waiman Long wrote:
> On 5/8/20 3:38 PM, Konstantin Khlebnikov wrote:
> > On 08/05/2020 22.05, Waiman Long wrote:
> > > On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
> > > > On 08/05/2020 17.49, Waiman Long wrote:
> > > > > On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> > > > > > Count of buckets is required for estimating average
> > > > > > length of hash chains.
> > > > > > Size of hash table depends on memory size and printed once at boot.
> > > > > > 
> > > > > > Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
> > > > > 
> > > > > The hash bucket count is a constant determined at boot time.
> > > > > Is there a need to use up one dentry_stat entry for that?
> > > > > Besides one can get it by looking up the kernel dmesg log
> > > > > like:
> > > > > 
> > > > > [    0.055212] Dentry cache hash table entries: 8388608
> > > > > (order: 14, 67108864 bytes)
> > > > 
> > > > Grepping logs since boot time is a worst API ever.
> > > > 
> > > > dentry-state shows count of dentries in various states.
> > > > It's very convenient to show count of buckets next to it,
> > > > because this number defines overall scale.
> > > 
> > > I am not against using the last free entry for that. My only concern
> > > is when we want to expose another internal dcache data point via
> > > dentry-state, we will have to add one more number to the array which
> > > can cause all sort of compatibility problem. So do we want to use
> > > the last free slot for a constant that can be retrieved from
> > > somewhere else?
> > 
> > I see no problem in adding more numbers into sysctl.
> > Especially into such rarely used.
> > This interface is designed for that.
> > 
> > Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0
> 
> Well, I got rebuke the last time I want to reuse one of age_limit/want_pages
> entry for negative dentry count because of the potential of breaking some
> really old applications or tools. Changing dentry-state to output one more
> number can potentially break compatibility too. That is why I am questioning
> if it is a good idea to use up the last free slot.

I'd rather see the nr_buckets exposed some other way, leaving this file
for dynamic state.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries
  2020-05-08 12:23 ` [PATCH RFC 8/8] dcache: prevent flooding with negative dentries Konstantin Khlebnikov
  2020-05-08 14:56   ` Matthew Wilcox
@ 2020-05-08 21:07   ` Waiman Long
  1 sibling, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-05-08 21:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> Without memory pressure count of negative dentries isn't bounded.
> They could consume all memory and drain all other inactive caches.
>
> Typical scenario is an idle system where some process periodically creates
> temporary files and removes them. After some time, memory will be filled
> with negative dentries for these random file names. Reclaiming them took
> some time because slab frees pages only when all related objects are gone.
> Time of dentry lookup is usually unaffected because hash table grows along
> with size of memory. Unless somebody especially crafts hash collisions.
> Simple lookup of random names also generates negative dentries very fast.
>
> This patch implements heuristic which detects such scenarios and prevents
> unbounded growth of completely unneeded negative dentries. It keeps up to
> three latest negative dentry in each bucket unless they were referenced.
>
> At first dput of negative dentry when it swept to the tail of siblings
> we'll also clear it's reference flag and look at next dentries in chain.
> Then kill third in series of negative, unused and unreferenced denries.
>
> This way each hash bucket will preserve three negative dentry to let them
> get reference and survive. Adding positive or used dentry into hash chain
> also protects few recent negative dentries. In result total size of dcache
> asymptotically limited by count of buckets and positive or used dentries.
>
> Before patch: tool 'dcache_stress' could fill entire memory with dentries.
>
> nr_dentry = 104913261   104.9M
> nr_buckets = 8388608    12.5 avg
> nr_unused = 104898729   100.0%
> nr_negative = 104883218 100.0%
>
> After this patch count of dentries saturates at around 3 per bucket:
>
> nr_dentry = 24619259    24.6M
> nr_buckets = 8388608    2.9 avg
> nr_unused = 24605226    99.9%
> nr_negative = 24600351  99.9%
>
> This heuristic isn't bulletproof and solves only most practical case.
> It's easy to deceive: just touch same random name twice.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   fs/dcache.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 60158065891e..9f3d331b4978 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -632,6 +632,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
>   	return __lock_parent(dentry);
>   }
>   
> +/*
> + * Called at first dput of each negative dentry.
> + * Prevents filling cache with never reused negative dentries.
> + *
> + * This clears reference and then looks at following dentries in hash chain.
> + * If they are negative, unused and unreferenced then keep two and kill third.
> + */
> +static void trim_negative(struct dentry *dentry)
> +	__releases(dentry->d_lock)
> +{
> +	struct dentry *victim, *parent;
> +	struct hlist_bl_node *next;
> +	int keep = 2;
> +
> +	rcu_read_lock();
> +
> +	dentry->d_flags &= ~DCACHE_REFERENCED;
> +	spin_unlock(&dentry->d_lock);
> +
> +	next = rcu_dereference_raw(dentry->d_hash.next);
> +	while (1) {
> +		victim = hlist_bl_entry(next, struct dentry, d_hash);
> +
> +		if (!next || d_count(victim) || !d_is_negative(victim) ||
> +		    (victim->d_flags & DCACHE_REFERENCED)) {
> +			rcu_read_unlock();
> +			return;
> +		}
> +
> +		if (!keep--)
> +			break;
> +
> +		next = rcu_dereference_raw(next->next);
> +	}
> +
> +	spin_lock(&victim->d_lock);
> +	parent = lock_parent(victim);
> +
> +	rcu_read_unlock();
> +
> +	if (d_count(victim) || !d_is_negative(victim) ||
> +	    (victim->d_flags & DCACHE_REFERENCED)) {
> +		if (parent)
> +			spin_unlock(&parent->d_lock);
> +		spin_unlock(&victim->d_lock);
> +		return;
> +	}
> +
> +	__dentry_kill(victim);
> +	dput(parent);
> +}

Since you are picking a victim from the hash list, I think it is better 
to kill it only if it has already been in the LRU. Otherwise, it could 
be in the process of being instantiated or in the middle of some operations.

Besides, I feel a bit uneasy about picking a random negative dentry to 
kill like that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 2/8] selftests: add stress testing tool for dcache
  2020-05-08 12:23 ` [PATCH RFC 2/8] selftests: add stress testing tool for dcache Konstantin Khlebnikov
@ 2020-05-13  1:52   ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2020-05-13  1:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro, Waiman Long

On Fri, May 08, 2020 at 03:23:17PM +0300, Konstantin Khlebnikov wrote:
> This tool fills dcache with negative dentries. Between iterations it prints
> statistics and measures time of inotify operation which might degrade.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  tools/testing/selftests/filesystems/Makefile       |    1 
>  .../testing/selftests/filesystems/dcache_stress.c  |  210 ++++++++++++++++++++

This sort of thing should go into fstests along with test scripts
that use it to exercise the dentry cache. We already have tools like
this in fstests (dirstress, metaperf, etc) for exercising name-based
operations like this, so it would fit right in.

That way it would get run by just about every filesystem developer
and distro QE department automatically and extremely frequently...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
                   ` (7 preceding siblings ...)
  2020-05-08 12:23 ` [PATCH RFC 8/8] dcache: prevent flooding with negative dentries Konstantin Khlebnikov
@ 2020-12-09 23:01 ` Junxiao Bi
  2020-12-12  7:32   ` Konstantin Khlebnikov
  8 siblings, 1 reply; 27+ messages in thread
From: Junxiao Bi @ 2020-12-09 23:01 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-fsdevel, linux-mm,
	Alexander Viro
  Cc: Waiman Long, Gautham Ananthakrishna, matthew.wilcox

Hi Konstantin,

We tested this patch set recently and found it limiting negative dentry 
to a small part of total memory. And also we don't see any performance 
regression on it. Do you have any plan to integrate it into mainline? It 
will help a lot on memory fragmentation issue causing by dentry slab, 
there were a lot of customer cases where sys% was very high since most 
cpu were doing memory compaction, dentry slab was taking too much memory 
and nearly all dentry there were negative.

The following is test result we run on two types of servers, one is 256G 
memory with 24 CPUS and another is 3T memory with 384 CPUS. The test 
case is using a lot of processes to generate negative dentry in 
parallel, the following is the test result after 72 hours, the negative 
dentry number is stable around that number even running longer time. If 
without the patch set, in less than half an hour 197G was took by 
negative dentry on 256G system, in 1 day 2.4T was took on 3T system.

                 neg-dentry-number        neg-dentry-mem-usage

256G 55259084 10.6G

3T 202306756 38.8G

For perf test, we run the following, and no regression found.

- create 1M negative dentry and then touch them to convert them to 
positive dentry

- create 10K/100K/1M files

- remove 10K/100K/1M files

- kernel compile

To verify the fsnotify fix, we used inotifywait to watch file 
create/open in some directory where there is a lot of negative dentry, 
without the patch set, the system will run into soft lockup, with it, no 
soft lockup.

We also try to defeat the limitation by making different processes 
generating negative dentry with the same naming way, that will make one 
negative dentry being accessed couple times around same time, 
DCACHE_REFERENCED will be set on it and then it can't be trimmed easily. 
We do see negative dentry will take all the memory slowly from one of 
our system with 120G memory, for above two system, we see the memory 
usage were increased, but still a small part of total memory. This looks 
ok, since the common negative dentry user case will be create some temp 
files and then remove it, it will be rare to access same negative dentry 
around same time.

Thanks,

Junxiao.


On 5/8/20 5:23 AM, Konstantin Khlebnikov wrote:
> For most filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory.
>
> Getting this memory back ins't that easy because slab frees pages only when
> all related objects are gone. While dcache shrinker works in LRU order.
>
> Typical scenario is an idle system where some process periodically creates
> temporary files and removes them. After some time, memory will be filled
> with negative dentries for these random file names.
>
> Simple lookup of random names also generates negative dentries very fast.
> Constant flow of such negative denries drains all other inactive caches.
>
> Negative dentries are linked into siblings list along with normal positive
> dentries. Some operations walks dcache tree but looks only for positive
> dentries: most important is fsnotify/inotify. Hordes of negative dentries
> slow down these operations significantly.
>
> Time of dentry lookup is usually unaffected because hash table grows along
> with size of memory. Unless somebody especially crafts hash collisions.
>
> This patch set solves all of these problems:
>
> Move negative denries to the end of sliblings list, thus walkers could
> skip them at first sight (patches 3-6).
>
> Keep in dcache at most three unreferenced negative denties in row in each
> hash bucket (patches 7-8).
>
> ---
>
> Konstantin Khlebnikov (8):
>        dcache: show count of hash buckets in sysctl fs.dentry-state
>        selftests: add stress testing tool for dcache
>        dcache: sweep cached negative dentries to the end of list of siblings
>        fsnotify: stop walking child dentries if remaining tail is negative
>        dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
>        dcache: stop walking siblings if remaining dentries all negative
>        dcache: push releasing dentry lock into sweep_negative
>        dcache: prevent flooding with negative dentries
>
>
>   fs/dcache.c                                   | 144 +++++++++++-
>   fs/libfs.c                                    |  10 +-
>   fs/notify/fsnotify.c                          |   6 +-
>   include/linux/dcache.h                        |   6 +
>   tools/testing/selftests/filesystems/Makefile  |   1 +
>   .../selftests/filesystems/dcache_stress.c     | 210 ++++++++++++++++++
>   6 files changed, 370 insertions(+), 7 deletions(-)
>   create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c
>
> --
> Signature

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-09 23:01 ` [PATCH RFC 0/8] dcache: increase poison resistance Junxiao Bi
@ 2020-12-12  7:32   ` Konstantin Khlebnikov
  2020-12-13 18:49     ` Junxiao Bi
  0 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-12-12  7:32 UTC (permalink / raw)
  To: Junxiao Bi
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

[-- Attachment #1: Type: text/plain, Size: 5817 bytes --]

On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi <junxiao.bi@oracle.com> wrote:

> Hi Konstantin,
>
> We tested this patch set recently and found it limiting negative dentry
> to a small part of total memory. And also we don't see any performance
> regression on it. Do you have any plan to integrate it into mainline? It
> will help a lot on memory fragmentation issue causing by dentry slab,
> there were a lot of customer cases where sys% was very high since most
> cpu were doing memory compaction, dentry slab was taking too much memory
> and nearly all dentry there were negative.
>

Right now I don't have any plans for this. I suspect such problems will
appear much more often since machines are getting bigger.
So, somebody will take care of it.

First part which collects negative dentries at the end list of siblings
could be
done in a more obvious way by splitting the list in two.
But this touches much more code.

Last patch isn't very rigid but does non-trivial changes.
Probably it's better to call some garbage collector thingy periodically.
Lru list needs pressure to age and reorder entries properly.

Gc could be off by default or thresholds set very high (50% of ram for
example).
Final setup could be left up to owners of large systems, which needs fine
tuning.


>
> The following is test result we run on two types of servers, one is 256G
> memory with 24 CPUS and another is 3T memory with 384 CPUS. The test
> case is using a lot of processes to generate negative dentry in
> parallel, the following is the test result after 72 hours, the negative
> dentry number is stable around that number even running longer time. If
> without the patch set, in less than half an hour 197G was took by
> negative dentry on 256G system, in 1 day 2.4T was took on 3T system.
>
>                  neg-dentry-number        neg-dentry-mem-usage
>
> 256G 55259084 10.6G
>
> 3T 202306756 38.8G
>
> For perf test, we run the following, and no regression found.
>
> - create 1M negative dentry and then touch them to convert them to
> positive dentry
>
> - create 10K/100K/1M files
>
> - remove 10K/100K/1M files
>
> - kernel compile
>
> To verify the fsnotify fix, we used inotifywait to watch file
> create/open in some directory where there is a lot of negative dentry,
> without the patch set, the system will run into soft lockup, with it, no
> soft lockup.
>
> We also try to defeat the limitation by making different processes
> generating negative dentry with the same naming way, that will make one
> negative dentry being accessed couple times around same time,
> DCACHE_REFERENCED will be set on it and then it can't be trimmed easily.
> We do see negative dentry will take all the memory slowly from one of
> our system with 120G memory, for above two system, we see the memory
> usage were increased, but still a small part of total memory. This looks
> ok, since the common negative dentry user case will be create some temp
> files and then remove it, it will be rare to access same negative dentry
> around same time.
>
> Thanks,
>
> Junxiao.
>
>
> On 5/8/20 5:23 AM, Konstantin Khlebnikov wrote:
> > For most filesystems result of every negative lookup is cached, content
> of
> > directories is usually cached too. Production of negative dentries isn't
> > limited with disk speed. It's really easy to generate millions of them if
> > system has enough memory.
> >
> > Getting this memory back ins't that easy because slab frees pages only
> when
> > all related objects are gone. While dcache shrinker works in LRU order.
> >
> > Typical scenario is an idle system where some process periodically
> creates
> > temporary files and removes them. After some time, memory will be filled
> > with negative dentries for these random file names.
> >
> > Simple lookup of random names also generates negative dentries very fast.
> > Constant flow of such negative denries drains all other inactive caches.
> >
> > Negative dentries are linked into siblings list along with normal
> positive
> > dentries. Some operations walks dcache tree but looks only for positive
> > dentries: most important is fsnotify/inotify. Hordes of negative dentries
> > slow down these operations significantly.
> >
> > Time of dentry lookup is usually unaffected because hash table grows
> along
> > with size of memory. Unless somebody especially crafts hash collisions.
> >
> > This patch set solves all of these problems:
> >
> > Move negative denries to the end of sliblings list, thus walkers could
> > skip them at first sight (patches 3-6).
> >
> > Keep in dcache at most three unreferenced negative denties in row in each
> > hash bucket (patches 7-8).
> >
> > ---
> >
> > Konstantin Khlebnikov (8):
> >        dcache: show count of hash buckets in sysctl fs.dentry-state
> >        selftests: add stress testing tool for dcache
> >        dcache: sweep cached negative dentries to the end of list of
> siblings
> >        fsnotify: stop walking child dentries if remaining tail is
> negative
> >        dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
> >        dcache: stop walking siblings if remaining dentries all negative
> >        dcache: push releasing dentry lock into sweep_negative
> >        dcache: prevent flooding with negative dentries
> >
> >
> >   fs/dcache.c                                   | 144 +++++++++++-
> >   fs/libfs.c                                    |  10 +-
> >   fs/notify/fsnotify.c                          |   6 +-
> >   include/linux/dcache.h                        |   6 +
> >   tools/testing/selftests/filesystems/Makefile  |   1 +
> >   .../selftests/filesystems/dcache_stress.c     | 210 ++++++++++++++++++
> >   6 files changed, 370 insertions(+), 7 deletions(-)
> >   create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c
> >
> > --
> > Signature
>
>

[-- Attachment #2: Type: text/html, Size: 7091 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-12  7:32   ` Konstantin Khlebnikov
@ 2020-12-13 18:49     ` Junxiao Bi
  2020-12-14  7:43       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 27+ messages in thread
From: Junxiao Bi @ 2020-12-13 18:49 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:

> On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi <junxiao.bi@oracle.com 
> <mailto:junxiao.bi@oracle.com>> wrote:
>
>     Hi Konstantin,
>
>     We tested this patch set recently and found it limiting negative
>     dentry
>     to a small part of total memory. And also we don't see any
>     performance
>     regression on it. Do you have any plan to integrate it into
>     mainline? It
>     will help a lot on memory fragmentation issue causing by dentry slab,
>     there were a lot of customer cases where sys% was very high since
>     most
>     cpu were doing memory compaction, dentry slab was taking too much
>     memory
>     and nearly all dentry there were negative.
>
>
> Right now I don't have any plans for this. I suspect such problems will
> appear much more often since machines are getting bigger.
> So, somebody will take care of it.
We already had a lot of customer cases. It made no sense to leave so 
many negative dentry in the system, it caused memory fragmentation and 
not much benefit.
>
> First part which collects negative dentries at the end list of 
> siblings could be
> done in a more obvious way by splitting the list in two.
> But this touches much more code.
That would add new field to dentry?
>
> Last patch isn't very rigid but does non-trivial changes.
> Probably it's better to call some garbage collector thingy periodically.
> Lru list needs pressure to age and reorder entries properly.

Swap the negative dentry to the head of hash list when it get accessed? 
Extra ones can be easily trimmed when swapping, using GC is to reduce 
perf impact?

Thanks,

Junxioao.

>
> Gc could be off by default or thresholds set very high (50% of ram for 
> example).
> Final setup could be left up to owners of large systems, which needs 
> fine tuning.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-13 18:49     ` Junxiao Bi
@ 2020-12-14  7:43       ` Konstantin Khlebnikov
  2020-12-14 23:10         ` Junxiao Bi
  0 siblings, 1 reply; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-12-14  7:43 UTC (permalink / raw)
  To: Junxiao Bi
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

[-- Attachment #1: Type: text/plain, Size: 2549 bytes --]

On Sun, Dec 13, 2020 at 9:52 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:

> On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:
>
> > On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi <junxiao.bi@oracle.com
> > <mailto:junxiao.bi@oracle.com>> wrote:
> >
> >     Hi Konstantin,
> >
> >     We tested this patch set recently and found it limiting negative
> >     dentry
> >     to a small part of total memory. And also we don't see any
> >     performance
> >     regression on it. Do you have any plan to integrate it into
> >     mainline? It
> >     will help a lot on memory fragmentation issue causing by dentry slab,
> >     there were a lot of customer cases where sys% was very high since
> >     most
> >     cpu were doing memory compaction, dentry slab was taking too much
> >     memory
> >     and nearly all dentry there were negative.
> >
> >
> > Right now I don't have any plans for this. I suspect such problems will
> > appear much more often since machines are getting bigger.
> > So, somebody will take care of it.
> We already had a lot of customer cases. It made no sense to leave so
> many negative dentry in the system, it caused memory fragmentation and
> not much benefit.
>

Dcache could grow so big only if the system lacks of memory pressure.

Simplest solution is a cronjob which provinces such pressure by
creating sparse file on disk-based fs and then reading it.
This should wash away all inactive caches with no IO and zero chance of oom.


> >
> > First part which collects negative dentries at the end list of
> > siblings could be
> > done in a more obvious way by splitting the list in two.
> > But this touches much more code.
> That would add new field to dentry?
>

Yep. Decision is up to maintainers.

>
> > Last patch isn't very rigid but does non-trivial changes.
> > Probably it's better to call some garbage collector thingy periodically.
> > Lru list needs pressure to age and reorder entries properly.
>
> Swap the negative dentry to the head of hash list when it get accessed?
> Extra ones can be easily trimmed when swapping, using GC is to reduce
> perf impact?
>

Reclaimer/shrinker scans denties in LRU lists, it's an another list.
My patch used order in hash lists is a very unusual way. Don't be confused.

There are four lists
parent - siblings
hashtable - hashchain
LRU
inode - alias


>
> Thanks,
>
> Junxioao.
>
> >
> > Gc could be off by default or thresholds set very high (50% of ram for
> > example).
> > Final setup could be left up to owners of large systems, which needs
> > fine tuning.
>

[-- Attachment #2: Type: text/html, Size: 3950 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-14  7:43       ` Konstantin Khlebnikov
@ 2020-12-14 23:10         ` Junxiao Bi
  2020-12-16 18:46           ` Junxiao Bi
  0 siblings, 1 reply; 27+ messages in thread
From: Junxiao Bi @ 2020-12-14 23:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

On 12/13/20 11:43 PM, Konstantin Khlebnikov wrote:

>
>
> On Sun, Dec 13, 2020 at 9:52 PM Junxiao Bi <junxiao.bi@oracle.com 
> <mailto:junxiao.bi@oracle.com>> wrote:
>
>     On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:
>
>     > On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi
>     <junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>
>     > <mailto:junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>>>
>     wrote:
>     >
>     >     Hi Konstantin,
>     >
>     >     We tested this patch set recently and found it limiting negative
>     >     dentry
>     >     to a small part of total memory. And also we don't see any
>     >     performance
>     >     regression on it. Do you have any plan to integrate it into
>     >     mainline? It
>     >     will help a lot on memory fragmentation issue causing by
>     dentry slab,
>     >     there were a lot of customer cases where sys% was very high
>     since
>     >     most
>     >     cpu were doing memory compaction, dentry slab was taking too
>     much
>     >     memory
>     >     and nearly all dentry there were negative.
>     >
>     >
>     > Right now I don't have any plans for this. I suspect such
>     problems will
>     > appear much more often since machines are getting bigger.
>     > So, somebody will take care of it.
>     We already had a lot of customer cases. It made no sense to leave so
>     many negative dentry in the system, it caused memory fragmentation
>     and
>     not much benefit.
>
>
> Dcache could grow so big only if the system lacks of memory pressure.
>
> Simplest solution is a cronjob which provinces such pressure by
> creating sparse file on disk-based fs and then reading it.
> This should wash away all inactive caches with no IO and zero chance 
> of oom.
Sound good, will try.
>
>     >
>     > First part which collects negative dentries at the end list of
>     > siblings could be
>     > done in a more obvious way by splitting the list in two.
>     > But this touches much more code.
>     That would add new field to dentry?
>
>
> Yep. Decision is up to maintainers.
>
>     >
>     > Last patch isn't very rigid but does non-trivial changes.
>     > Probably it's better to call some garbage collector thingy
>     periodically.
>     > Lru list needs pressure to age and reorder entries properly.
>
>     Swap the negative dentry to the head of hash list when it get
>     accessed?
>     Extra ones can be easily trimmed when swapping, using GC is to reduce
>     perf impact?
>
>
> Reclaimer/shrinker scans denties in LRU lists, it's an another list.

Ah, you mean GC to reclaim from LRU list. I am not sure it could catch 
up the speed of negative dentry generating.

Thanks,

Junxiao.

> My patch used order in hash lists is a very unusual way. Don't be 
> confused.
>
> There are four lists
> parent - siblings
> hashtable - hashchain
> LRU
> inode - alias
>
>
>     Thanks,
>
>     Junxioao.
>
>     >
>     > Gc could be off by default or thresholds set very high (50% of
>     ram for
>     > example).
>     > Final setup could be left up to owners of large systems, which
>     needs
>     > fine tuning.
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-14 23:10         ` Junxiao Bi
@ 2020-12-16 18:46           ` Junxiao Bi
  2020-12-17 15:47             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 27+ messages in thread
From: Junxiao Bi @ 2020-12-16 18:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

Hi Konstantin,

How would you like to proceed with this patch set?

This patchset as it is already fixed the customer issue we faced, it 
will stop memory fragmentation causing by negative dentry and no 
performance regression through our test. In production workload, it is 
common that some app kept creating and removing tmp files, this will 
leave a lot of negative dentry over time, some time later, it will cause 
memory fragmentation and system run into memory compaction and not 
responsible. It will be good to push it to upstream merge. If you are 
busy, we can try push it again.

Thanks,

Junxiao.

On 12/14/20 3:10 PM, Junxiao Bi wrote:
> On 12/13/20 11:43 PM, Konstantin Khlebnikov wrote:
>
>>
>>
>> On Sun, Dec 13, 2020 at 9:52 PM Junxiao Bi <junxiao.bi@oracle.com 
>> <mailto:junxiao.bi@oracle.com>> wrote:
>>
>>     On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:
>>
>>     > On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi
>>     <junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>
>>     > <mailto:junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>>>
>>     wrote:
>>     >
>>     >     Hi Konstantin,
>>     >
>>     >     We tested this patch set recently and found it limiting 
>> negative
>>     >     dentry
>>     >     to a small part of total memory. And also we don't see any
>>     >     performance
>>     >     regression on it. Do you have any plan to integrate it into
>>     >     mainline? It
>>     >     will help a lot on memory fragmentation issue causing by
>>     dentry slab,
>>     >     there were a lot of customer cases where sys% was very high
>>     since
>>     >     most
>>     >     cpu were doing memory compaction, dentry slab was taking too
>>     much
>>     >     memory
>>     >     and nearly all dentry there were negative.
>>     >
>>     >
>>     > Right now I don't have any plans for this. I suspect such
>>     problems will
>>     > appear much more often since machines are getting bigger.
>>     > So, somebody will take care of it.
>>     We already had a lot of customer cases. It made no sense to leave so
>>     many negative dentry in the system, it caused memory fragmentation
>>     and
>>     not much benefit.
>>
>>
>> Dcache could grow so big only if the system lacks of memory pressure.
>>
>> Simplest solution is a cronjob which provinces such pressure by
>> creating sparse file on disk-based fs and then reading it.
>> This should wash away all inactive caches with no IO and zero chance 
>> of oom.
> Sound good, will try.
>>
>>     >
>>     > First part which collects negative dentries at the end list of
>>     > siblings could be
>>     > done in a more obvious way by splitting the list in two.
>>     > But this touches much more code.
>>     That would add new field to dentry?
>>
>>
>> Yep. Decision is up to maintainers.
>>
>>     >
>>     > Last patch isn't very rigid but does non-trivial changes.
>>     > Probably it's better to call some garbage collector thingy
>>     periodically.
>>     > Lru list needs pressure to age and reorder entries properly.
>>
>>     Swap the negative dentry to the head of hash list when it get
>>     accessed?
>>     Extra ones can be easily trimmed when swapping, using GC is to 
>> reduce
>>     perf impact?
>>
>>
>> Reclaimer/shrinker scans denties in LRU lists, it's an another list.
>
> Ah, you mean GC to reclaim from LRU list. I am not sure it could catch 
> up the speed of negative dentry generating.
>
> Thanks,
>
> Junxiao.
>
>> My patch used order in hash lists is a very unusual way. Don't be 
>> confused.
>>
>> There are four lists
>> parent - siblings
>> hashtable - hashchain
>> LRU
>> inode - alias
>>
>>
>>     Thanks,
>>
>>     Junxioao.
>>
>>     >
>>     > Gc could be off by default or thresholds set very high (50% of
>>     ram for
>>     > example).
>>     > Final setup could be left up to owners of large systems, which
>>     needs
>>     > fine tuning.
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 0/8] dcache: increase poison resistance
  2020-12-16 18:46           ` Junxiao Bi
@ 2020-12-17 15:47             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 27+ messages in thread
From: Konstantin Khlebnikov @ 2020-12-17 15:47 UTC (permalink / raw)
  To: Junxiao Bi
  Cc: Konstantin Khlebnikov, Linux Kernel Mailing List, linux-fsdevel,
	linux-mm, Alexander Viro, Waiman Long, Gautham Ananthakrishna,
	matthew.wilcox

[-- Attachment #1: Type: text/plain, Size: 4346 bytes --]

On Wed, Dec 16, 2020 at 9:47 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:

> Hi Konstantin,
>
> How would you like to proceed with this patch set?
>
> This patchset as it is already fixed the customer issue we faced, it
> will stop memory fragmentation causing by negative dentry and no
> performance regression through our test. In production workload, it is
> common that some app kept creating and removing tmp files, this will
> leave a lot of negative dentry over time, some time later, it will cause
> memory fragmentation and system run into memory compaction and not
> responsible. It will be good to push it to upstream merge. If you are
> busy, we can try push it again.
>

Feel free to try.

You may drop fist hunk of changes which fixes slow inotify and keep only
dances around hash chains.


Thanks,
>
> Junxiao.
>
> On 12/14/20 3:10 PM, Junxiao Bi wrote:
> > On 12/13/20 11:43 PM, Konstantin Khlebnikov wrote:
> >
> >>
> >>
> >> On Sun, Dec 13, 2020 at 9:52 PM Junxiao Bi <junxiao.bi@oracle.com
> >> <mailto:junxiao.bi@oracle.com>> wrote:
> >>
> >>     On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:
> >>
> >>     > On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi
> >>     <junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>
> >>     > <mailto:junxiao.bi@oracle.com <mailto:junxiao.bi@oracle.com>>>
> >>     wrote:
> >>     >
> >>     >     Hi Konstantin,
> >>     >
> >>     >     We tested this patch set recently and found it limiting
> >> negative
> >>     >     dentry
> >>     >     to a small part of total memory. And also we don't see any
> >>     >     performance
> >>     >     regression on it. Do you have any plan to integrate it into
> >>     >     mainline? It
> >>     >     will help a lot on memory fragmentation issue causing by
> >>     dentry slab,
> >>     >     there were a lot of customer cases where sys% was very high
> >>     since
> >>     >     most
> >>     >     cpu were doing memory compaction, dentry slab was taking too
> >>     much
> >>     >     memory
> >>     >     and nearly all dentry there were negative.
> >>     >
> >>     >
> >>     > Right now I don't have any plans for this. I suspect such
> >>     problems will
> >>     > appear much more often since machines are getting bigger.
> >>     > So, somebody will take care of it.
> >>     We already had a lot of customer cases. It made no sense to leave so
> >>     many negative dentry in the system, it caused memory fragmentation
> >>     and
> >>     not much benefit.
> >>
> >>
> >> Dcache could grow so big only if the system lacks of memory pressure.
> >>
> >> Simplest solution is a cronjob which provinces such pressure by
> >> creating sparse file on disk-based fs and then reading it.
> >> This should wash away all inactive caches with no IO and zero chance
> >> of oom.
> > Sound good, will try.
> >>
> >>     >
> >>     > First part which collects negative dentries at the end list of
> >>     > siblings could be
> >>     > done in a more obvious way by splitting the list in two.
> >>     > But this touches much more code.
> >>     That would add new field to dentry?
> >>
> >>
> >> Yep. Decision is up to maintainers.
> >>
> >>     >
> >>     > Last patch isn't very rigid but does non-trivial changes.
> >>     > Probably it's better to call some garbage collector thingy
> >>     periodically.
> >>     > Lru list needs pressure to age and reorder entries properly.
> >>
> >>     Swap the negative dentry to the head of hash list when it get
> >>     accessed?
> >>     Extra ones can be easily trimmed when swapping, using GC is to
> >> reduce
> >>     perf impact?
> >>
> >>
> >> Reclaimer/shrinker scans denties in LRU lists, it's an another list.
> >
> > Ah, you mean GC to reclaim from LRU list. I am not sure it could catch
> > up the speed of negative dentry generating.
> >
> > Thanks,
> >
> > Junxiao.
> >
> >> My patch used order in hash lists is a very unusual way. Don't be
> >> confused.
> >>
> >> There are four lists
> >> parent - siblings
> >> hashtable - hashchain
> >> LRU
> >> inode - alias
> >>
> >>
> >>     Thanks,
> >>
> >>     Junxioao.
> >>
> >>     >
> >>     > Gc could be off by default or thresholds set very high (50% of
> >>     ram for
> >>     > example).
> >>     > Final setup could be left up to owners of large systems, which
> >>     needs
> >>     > fine tuning.
> >>
>

[-- Attachment #2: Type: text/html, Size: 6539 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-12-17 15:47 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08 12:23 [PATCH RFC 0/8] dcache: increase poison resistance Konstantin Khlebnikov
2020-05-08 12:23 ` [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state Konstantin Khlebnikov
2020-05-08 14:49   ` Waiman Long
2020-05-08 16:16     ` Konstantin Khlebnikov
2020-05-08 19:05       ` Waiman Long
2020-05-08 19:38         ` Konstantin Khlebnikov
2020-05-08 20:00           ` Waiman Long
2020-05-08 20:03             ` Matthew Wilcox
2020-05-08 12:23 ` [PATCH RFC 2/8] selftests: add stress testing tool for dcache Konstantin Khlebnikov
2020-05-13  1:52   ` Dave Chinner
2020-05-08 12:23 ` [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings Konstantin Khlebnikov
2020-05-08 19:38   ` Waiman Long
2020-05-08 12:23 ` [PATCH RFC 4/8] fsnotify: stop walking child dentries if remaining tail is negative Konstantin Khlebnikov
2020-05-08 12:23 ` [PATCH RFC 5/8] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Konstantin Khlebnikov
2020-05-08 12:23 ` [PATCH RFC 6/8] dcache: stop walking siblings if remaining dentries all negative Konstantin Khlebnikov
2020-05-08 12:23 ` [PATCH RFC 7/8] dcache: push releasing dentry lock into sweep_negative Konstantin Khlebnikov
2020-05-08 12:23 ` [PATCH RFC 8/8] dcache: prevent flooding with negative dentries Konstantin Khlebnikov
2020-05-08 14:56   ` Matthew Wilcox
2020-05-08 16:29     ` Konstantin Khlebnikov
2020-05-08 21:07   ` Waiman Long
2020-12-09 23:01 ` [PATCH RFC 0/8] dcache: increase poison resistance Junxiao Bi
2020-12-12  7:32   ` Konstantin Khlebnikov
2020-12-13 18:49     ` Junxiao Bi
2020-12-14  7:43       ` Konstantin Khlebnikov
2020-12-14 23:10         ` Junxiao Bi
2020-12-16 18:46           ` Junxiao Bi
2020-12-17 15:47             ` Konstantin Khlebnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.