linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] fix the negative dentres bloating system memory usage
@ 2021-01-21 13:19 Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

For most filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory.

Getting this memory back ins't that easy because slab frees pages only when
all related objects are gone. While dcache shrinker works in LRU order.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names.

Simple lookup of random names also generates negative dentries very fast.
Constant flow of such negative denries drains all other inactive caches.
Too many negative dentries in the system can cause memory fragmentation
and memory compaction.

Negative dentries are linked into siblings list along with normal positive
dentries. Some operations walks dcache tree but looks only for positive
dentries: most important is fsnotify/inotify. Hordes of negative dentries
slow down these operations significantly.

Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.

This patch set solves all of these problems:

Move negative denries to the end of sliblings list, thus walkers could
skip them at first sight (patches 1-4).

Keep in dcache at most three unreferenced negative denties in row in each
hash bucket (patches 5-6).

We tested this patch set recently and found it limiting negative dentry to a
small part of total memory. The following is the test result we ran on two
types of servers, one is 256G memory with 24 CPUS and another is 3T memory
with 384 CPUS. The test case is using a lot of processes to generate negative
dentry in parallel, the following is the test result after 72 hours, the
negative dentry number is stable around that number even after running longer
for much longer time. Without the patch set, in less than half an hour 197G was
taken by negative dentry on 256G system, in 1 day 2.4T was taken on 3T system.

system memory   neg-dentry-number   neg-dentry-mem-usage
256G            55259084            10.6G
3T              202306756           38.8G

For perf test, we ran the following, and no regression found.

1. create 1M negative dentry and then touch them to convert them to positive
   dentry

2. create 10K/100K/1M files

3. remove 10K/100K/1M files

4. kernel compile

To verify the fsnotify fix, we used inotifywait to watch file create/open in
some directory where there is a lot of negative dentry, without the patch set,
the system would run into soft lockup, with it, no soft lockup was found.

We also tried to defeat the limitation by making different processes generate
negative dentry with the same name, that will make one negative dentry being
accessed couple times around same time, DCACHE_REFERENCED will be set on it
and it can't be trimmed easily.

There were a lot of customer cases on this issue. It makes no sense to leave
so many negative dentry, it just causes memory fragmentation and compaction
and does not help a lot.

Konstantin Khlebnikov (6):
  dcache: sweep cached negative dentries to the end of list of siblings
  fsnotify: stop walking child dentries if remaining tail is negative
  dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
  dcache: stop walking siblings if remaining dentries all negative
  dcache: push releasing dentry lock into sweep_negative
  dcache: prevent flooding with negative dentries

 fs/dcache.c            | 135 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/libfs.c             |   3 ++
 fs/notify/fsnotify.c   |   6 ++-
 include/linux/dcache.h |   6 +++
 4 files changed, 145 insertions(+), 5 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-04-14  3:00   ` Al Viro
  2021-04-14  3:41   ` Al Viro
  2021-01-21 13:19 ` [PATCH RFC 2/6] fsnotify: stop walking child dentries if remaining tail is negative Gautham Ananthakrishna
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

For disk filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory. Negative dentries are linked into siblings list
along with normal positive dentries. Some operations walks dcache tree but
looks only for positive dentries: most important is fsnotify/inotify.

This patch moves negative dentries to the end of list at final dput() and
marks with flag which tells that all following dentries are negative too.
Reverse operation is required before instantiating negative dentry.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dcache.h |  6 +++++
 2 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ea04858..a506169 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 	return __lock_parent(dentry);
 }
 
+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	if (!d_is_tail_negative(dentry)) {
+		parent = lock_parent(dentry);
+		if (!parent)
+			return;
+
+		if (!d_count(dentry) && d_is_negative(dentry) &&
+		    !d_is_tail_negative(dentry)) {
+			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+			list_move_tail(&dentry->d_child, &parent->d_subdirs);
+		}
+
+		spin_unlock(&parent->d_lock);
+	}
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	spin_lock(&dentry->d_lock);
+	parent = lock_parent(dentry);
+	dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+	if (parent) {
+		list_move(&dentry->d_child, &parent->d_subdirs);
+		spin_unlock(&parent->d_lock);
+	}
+	spin_unlock(&dentry->d_lock);
+}
+
 static inline bool retain_dentry(struct dentry *dentry)
 {
 	WARN_ON(d_in_lookup(dentry));
@@ -737,7 +779,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 static inline bool fast_dput(struct dentry *dentry)
 {
 	int ret;
-	unsigned int d_flags;
+	unsigned int d_flags, required;
 
 	/*
 	 * If we have a d_op->d_delete() operation, we sould not
@@ -785,6 +827,8 @@ static inline bool fast_dput(struct dentry *dentry)
 	 * a 'delete' op, and it's referenced and already on
 	 * the LRU list.
 	 *
+	 * Cached negative dentry must be swept to the tail.
+	 *
 	 * NOTE! Since we aren't locked, these values are
 	 * not "stable". However, it is sufficient that at
 	 * some point after we dropped the reference the
@@ -796,10 +840,15 @@ static inline bool fast_dput(struct dentry *dentry)
 	 */
 	smp_rmb();
 	d_flags = READ_ONCE(dentry->d_flags);
-	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
+
+	required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		(d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
+	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;
 
 	/* Nothing to do? Dropping the reference was all we needed? */
-	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+	if (d_flags == required && !d_unhashed(dentry))
 		return true;
 
 	/*
@@ -871,6 +920,8 @@ void dput(struct dentry *dentry)
 		rcu_read_unlock();
 
 		if (likely(retain_dentry(dentry))) {
+			if (d_is_negative(dentry))
+				sweep_negative(dentry);
 			spin_unlock(&dentry->d_lock);
 			return;
 		}
@@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 	if (inode) {
+		if (d_is_tail_negative(entry))
+			recycle_negative(entry);
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
 		__d_instantiate(entry, inode);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6f95c33..5f4ce3a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,6 +219,7 @@ struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE		0x80000000 /* All following siblings are negative */
 
 extern seqlock_t rename_lock;
 
@@ -495,6 +496,11 @@ static inline int simple_positive(const struct dentry *dentry)
 	return d_really_is_positive(dentry) && !d_unhashed(dentry);
 }
 
+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+	return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
 extern void d_set_fallthru(struct dentry *dentry);
 
 static inline bool d_is_fallthru(const struct dentry *dentry)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/6] fsnotify: stop walking child dentries if remaining tail is negative
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 3/6] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Gautham Ananthakrishna
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

When notification starts/stops listening events from inode's children it
have to update dentry->d_flags of all positive child dentries. Scanning
may took a long time if directory has a lot of negative child dentries.

This is main beneficiary of sweeping cached negative dentries to the end.

Before patch:

nr_dentry = 24172597    24.2M
nr_buckets = 8388608    2.9 avg
nr_unused = 24158110    99.9%
nr_negative = 24142810  99.9%

inotify time: 0.507182 seconds

After patch:

nr_dentry = 24562747    24.6M
nr_buckets = 8388608    2.9 avg
nr_unused = 24548714    99.9%
nr_negative = 24543867  99.9%

inotify time: 0.000010 seconds

Negative dentries no longer slow down inotify op at parent directory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/notify/fsnotify.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 8d3ad5e..4ccb59d 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -127,8 +127,12 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		 * original inode) */
 		spin_lock(&alias->d_lock);
 		list_for_each_entry(child, &alias->d_subdirs, d_child) {
-			if (!child->d_inode)
+			if (!child->d_inode) {
+				/* all remaining children are negative */
+				if (d_is_tail_negative(child))
+					break;
 				continue;
+			}
 
 			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 			if (watched)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/6] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 2/6] fsnotify: stop walking child dentries if remaining tail is negative Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 4/6] dcache: stop walking siblings if remaining dentries all negative Gautham Ananthakrishna
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This lets skip remaining siblings at seeing d_is_tail_negative().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index a506169..894e6da 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1320,12 +1320,14 @@ void shrink_dcache_sb(struct super_block *sb)
  * @D_WALK_QUIT:	quit walk
  * @D_WALK_NORETRY:	quit when retry is needed
  * @D_WALK_SKIP:	skip this dentry and its children
+ * @D_WALK_SKIP_SIBLINGS: skip siblings and their children
  */
 enum d_walk_ret {
 	D_WALK_CONTINUE,
 	D_WALK_QUIT,
 	D_WALK_NORETRY,
 	D_WALK_SKIP,
+	D_WALK_SKIP_SIBLINGS,
 };
 
 /**
@@ -1356,6 +1358,7 @@ static void d_walk(struct dentry *parent, void *data,
 		break;
 	case D_WALK_QUIT:
 	case D_WALK_SKIP:
+	case D_WALK_SKIP_SIBLINGS:
 		goto out_unlock;
 	case D_WALK_NORETRY:
 		retry = false;
@@ -1387,6 +1390,9 @@ static void d_walk(struct dentry *parent, void *data,
 		case D_WALK_SKIP:
 			spin_unlock(&dentry->d_lock);
 			continue;
+		case D_WALK_SKIP_SIBLINGS:
+			spin_unlock(&dentry->d_lock);
+			goto skip_siblings;
 		}
 
 		if (!list_empty(&dentry->d_subdirs)) {
@@ -1398,6 +1404,7 @@ static void d_walk(struct dentry *parent, void *data,
 		}
 		spin_unlock(&dentry->d_lock);
 	}
+skip_siblings:
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 4/6] dcache: stop walking siblings if remaining dentries all negative
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
                   ` (2 preceding siblings ...)
  2021-01-21 13:19 ` [PATCH RFC 3/6] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 5/6] dcache: push releasing dentry lock into sweep_negative Gautham Ananthakrishna
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Most walkers are interested only in positive dentries.

Changes in simple_* libfs helpers are mostly cosmetic: it shouldn't cache
negative dentries unless uses d_delete other than always_delete_dentry().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c | 9 +++++++++
 fs/libfs.c  | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 894e6da..492a42f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1459,6 +1459,8 @@ static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry)
 	struct check_mount *info = data;
 	struct path path = { .mnt = info->mnt, .dentry = dentry };
 
+	if (d_is_tail_negative(dentry))
+		return D_WALK_SKIP_SIBLINGS;
 	if (likely(!d_mountpoint(dentry)))
 		return D_WALK_CONTINUE;
 	if (__path_is_mountpoint(&path)) {
@@ -1705,6 +1707,10 @@ void shrink_dcache_for_umount(struct super_block *sb)
 static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
 {
 	struct dentry **victim = _data;
+
+	if (d_is_tail_negative(dentry))
+		return D_WALK_SKIP_SIBLINGS;
+
 	if (d_mountpoint(dentry)) {
 		__dget_dlock(dentry);
 		*victim = dentry;
@@ -3174,6 +3180,9 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
 {
 	struct dentry *root = data;
 	if (dentry != root) {
+		if (d_is_tail_negative(dentry))
+			return D_WALK_SKIP_SIBLINGS;
+
 		if (d_unhashed(dentry) || !dentry->d_inode)
 			return D_WALK_SKIP;
 
diff --git a/fs/libfs.c b/fs/libfs.c
index 7124c2e..15d5ecf 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -410,6 +410,9 @@ int simple_empty(struct dentry *dentry)
 
 	spin_lock(&dentry->d_lock);
 	list_for_each_entry(child, &dentry->d_subdirs, d_child) {
+		if (d_is_tail_negative(child))
+			break;
+
 		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 		if (simple_positive(child)) {
 			spin_unlock(&child->d_lock);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 5/6] dcache: push releasing dentry lock into sweep_negative
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
                   ` (3 preceding siblings ...)
  2021-01-21 13:19 ` [PATCH RFC 4/6] dcache: stop walking siblings if remaining dentries all negative Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-01-21 13:19 ` [PATCH RFC 6/6] dcache: prevent flooding with negative dentries Gautham Ananthakrishna
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Release the dentry lock inside the sweep_negative() function.  This
is in preparation for a follow up patch and doesn't change runtime
behavior.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrisha <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 492a42f..22c990b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -638,13 +638,14 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
  * Must be called at dput of negative dentry.
  */
 static void sweep_negative(struct dentry *dentry)
+	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
 
 	if (!d_is_tail_negative(dentry)) {
 		parent = lock_parent(dentry);
 		if (!parent)
-			return;
+			goto out;
 
 		if (!d_count(dentry) && d_is_negative(dentry) &&
 		    !d_is_tail_negative(dentry)) {
@@ -654,6 +655,8 @@ static void sweep_negative(struct dentry *dentry)
 
 		spin_unlock(&parent->d_lock);
 	}
+out:
+	spin_unlock(&dentry->d_lock);
 }
 
 /*
@@ -922,7 +925,8 @@ void dput(struct dentry *dentry)
 		if (likely(retain_dentry(dentry))) {
 			if (d_is_negative(dentry))
 				sweep_negative(dentry);
-			spin_unlock(&dentry->d_lock);
+			else
+				spin_unlock(&dentry->d_lock);
 			return;
 		}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 6/6] dcache: prevent flooding with negative dentries
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
                   ` (4 preceding siblings ...)
  2021-01-21 13:19 ` [PATCH RFC 5/6] dcache: push releasing dentry lock into sweep_negative Gautham Ananthakrishna
@ 2021-01-21 13:19 ` Gautham Ananthakrishna
  2021-04-14  3:56   ` Al Viro
  2021-03-31 14:23 ` [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Matthew Wilcox
  2021-04-14  2:40 ` Al Viro
  7 siblings, 1 reply; 14+ messages in thread
From: Gautham Ananthakrishna @ 2021-01-21 13:19 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: viro, matthew.wilcox, khlebnikov, gautham.ananthakrishna

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Without memory pressure count of negative dentries isn't bounded.
They could consume all memory and drain all other inactive caches.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names. Reclaiming them took
some time because slab frees pages only when all related objects are gone.
Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.
Simple lookup of random names also generates negative dentries very fast.

This patch implements heuristic which detects such scenarios and prevents
unbounded growth of completely unneeded negative dentries. It keeps up to
three latest negative dentry in each bucket unless they were referenced.

At first dput of negative dentry when it swept to the tail of siblings
we'll also clear it's reference flag and look at next dentries in chain.
Then kill third in series of negative, unused and unreferenced denries.

This way each hash bucket will preserve three negative dentry to let them
get reference and survive. Adding positive or used dentry into hash chain
also protects few recent negative dentries. In result total size of dcache
asymptotically limited by count of buckets and positive or used dentries.

Before patch: tool 'dcache_stress' could fill entire memory with dentries.

nr_dentry = 104913261   104.9M
nr_buckets = 8388608    12.5 avg
nr_unused = 104898729   100.0%
nr_negative = 104883218 100.0%

After this patch count of dentries saturates at around 3 per bucket:

nr_dentry = 24619259    24.6M
nr_buckets = 8388608    2.9 avg
nr_unused = 24605226    99.9%
nr_negative = 24600351  99.9%

This heuristic isn't bulletproof and solves only most practical case.
It's easy to deceive: just touch same random name twice.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 22c990b..6281938 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -633,6 +633,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 }
 
 /*
+ * Called at first dput of each negative dentry.
+ * Prevents filling cache with never reused negative dentries.
+ *
+ * This clears reference and then looks at following dentries in hash chain.
+ * If they are negative, unused and unreferenced then keep two and kill third.
+ */
+static void trim_negative(struct dentry *dentry)
+	__releases(dentry->d_lock)
+{
+	struct dentry *victim, *parent;
+	struct hlist_bl_node *next;
+	int keep = 2;
+
+	rcu_read_lock();
+
+	dentry->d_flags &= ~DCACHE_REFERENCED;
+	spin_unlock(&dentry->d_lock);
+
+	next = rcu_dereference_raw(dentry->d_hash.next);
+	while (1) {
+		victim = hlist_bl_entry(next, struct dentry, d_hash);
+
+		if (!next || d_count(victim) || !d_is_negative(victim) ||
+		    (victim->d_flags & DCACHE_REFERENCED)) {
+			rcu_read_unlock();
+			return;
+		}
+
+		if (!keep--)
+			break;
+
+		next = rcu_dereference_raw(next->next);
+	}
+
+	spin_lock(&victim->d_lock);
+	parent = lock_parent(victim);
+
+	rcu_read_unlock();
+
+	if (d_count(victim) || !d_is_negative(victim) ||
+	    (victim->d_flags & DCACHE_REFERENCED)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		spin_unlock(&victim->d_lock);
+		return;
+	}
+
+	__dentry_kill(victim);
+	dput(parent);
+}
+
+/*
  * Move cached negative dentry to the tail of parent->d_subdirs.
  * This lets walkers skip them all together at first sight.
  * Must be called at dput of negative dentry.
@@ -654,6 +706,8 @@ static void sweep_negative(struct dentry *dentry)
 		}
 
 		spin_unlock(&parent->d_lock);
+
+		return trim_negative(dentry);
 	}
 out:
 	spin_unlock(&dentry->d_lock);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/6] fix the negative dentres bloating system memory usage
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
                   ` (5 preceding siblings ...)
  2021-01-21 13:19 ` [PATCH RFC 6/6] dcache: prevent flooding with negative dentries Gautham Ananthakrishna
@ 2021-03-31 14:23 ` Matthew Wilcox
  2021-04-14  2:40 ` Al Viro
  7 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2021-03-31 14:23 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, viro, matthew.wilcox, khlebnikov

Ping?  These patches are looking pretty good in our internal testing.

On Thu, Jan 21, 2021 at 06:49:39PM +0530, Gautham Ananthakrishna wrote:
> For most filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory.
> 
> Getting this memory back ins't that easy because slab frees pages only when
> all related objects are gone. While dcache shrinker works in LRU order.
> 
> Typical scenario is an idle system where some process periodically creates
> temporary files and removes them. After some time, memory will be filled
> with negative dentries for these random file names.
> 
> Simple lookup of random names also generates negative dentries very fast.
> Constant flow of such negative denries drains all other inactive caches.
> Too many negative dentries in the system can cause memory fragmentation
> and memory compaction.
> 
> Negative dentries are linked into siblings list along with normal positive
> dentries. Some operations walks dcache tree but looks only for positive
> dentries: most important is fsnotify/inotify. Hordes of negative dentries
> slow down these operations significantly.
> 
> Time of dentry lookup is usually unaffected because hash table grows along
> with size of memory. Unless somebody especially crafts hash collisions.
> 
> This patch set solves all of these problems:
> 
> Move negative denries to the end of sliblings list, thus walkers could
> skip them at first sight (patches 1-4).
> 
> Keep in dcache at most three unreferenced negative denties in row in each
> hash bucket (patches 5-6).
> 
> We tested this patch set recently and found it limiting negative dentry to a
> small part of total memory. The following is the test result we ran on two
> types of servers, one is 256G memory with 24 CPUS and another is 3T memory
> with 384 CPUS. The test case is using a lot of processes to generate negative
> dentry in parallel, the following is the test result after 72 hours, the
> negative dentry number is stable around that number even after running longer
> for much longer time. Without the patch set, in less than half an hour 197G was
> taken by negative dentry on 256G system, in 1 day 2.4T was taken on 3T system.
> 
> system memory   neg-dentry-number   neg-dentry-mem-usage
> 256G            55259084            10.6G
> 3T              202306756           38.8G
> 
> For perf test, we ran the following, and no regression found.
> 
> 1. create 1M negative dentry and then touch them to convert them to positive
>    dentry
> 
> 2. create 10K/100K/1M files
> 
> 3. remove 10K/100K/1M files
> 
> 4. kernel compile
> 
> To verify the fsnotify fix, we used inotifywait to watch file create/open in
> some directory where there is a lot of negative dentry, without the patch set,
> the system would run into soft lockup, with it, no soft lockup was found.
> 
> We also tried to defeat the limitation by making different processes generate
> negative dentry with the same name, that will make one negative dentry being
> accessed couple times around same time, DCACHE_REFERENCED will be set on it
> and it can't be trimmed easily.
> 
> There were a lot of customer cases on this issue. It makes no sense to leave
> so many negative dentry, it just causes memory fragmentation and compaction
> and does not help a lot.
> 
> Konstantin Khlebnikov (6):
>   dcache: sweep cached negative dentries to the end of list of siblings
>   fsnotify: stop walking child dentries if remaining tail is negative
>   dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
>   dcache: stop walking siblings if remaining dentries all negative
>   dcache: push releasing dentry lock into sweep_negative
>   dcache: prevent flooding with negative dentries
> 
>  fs/dcache.c            | 135 +++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/libfs.c             |   3 ++
>  fs/notify/fsnotify.c   |   6 ++-
>  include/linux/dcache.h |   6 +++
>  4 files changed, 145 insertions(+), 5 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/6] fix the negative dentres bloating system memory usage
  2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
                   ` (6 preceding siblings ...)
  2021-03-31 14:23 ` [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Matthew Wilcox
@ 2021-04-14  2:40 ` Al Viro
  7 siblings, 0 replies; 14+ messages in thread
From: Al Viro @ 2021-04-14  2:40 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Thu, Jan 21, 2021 at 06:49:39PM +0530, Gautham Ananthakrishna wrote:

> We tested this patch set recently and found it limiting negative dentry to a
> small part of total memory. The following is the test result we ran on two
> types of servers, one is 256G memory with 24 CPUS and another is 3T memory
> with 384 CPUS. The test case is using a lot of processes to generate negative
> dentry in parallel, the following is the test result after 72 hours, the
> negative dentry number is stable around that number even after running longer
> for much longer time. Without the patch set, in less than half an hour 197G was
> taken by negative dentry on 256G system, in 1 day 2.4T was taken on 3T system.
> 
> system memory   neg-dentry-number   neg-dentry-mem-usage
> 256G            55259084            10.6G
> 3T              202306756           38.8G
> 
> For perf test, we ran the following, and no regression found.
> 
> 1. create 1M negative dentry and then touch them to convert them to positive
>    dentry
> 
> 2. create 10K/100K/1M files
> 
> 3. remove 10K/100K/1M files
> 
> 4. kernel compile

Good for you; how would that work for thinner boxen, though?  I agree that if you
have 8M hash buckets your "no more than 3 unused negatives per bucket" is generous
enough for everything, but that's less obvious for something with e.g 4 or 8 gigs.
And believe it or not, there are real-world boxen like that ;-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings
  2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
@ 2021-04-14  3:00   ` Al Viro
  2021-04-15 16:50     ` Al Viro
  2021-04-14  3:41   ` Al Viro
  1 sibling, 1 reply; 14+ messages in thread
From: Al Viro @ 2021-04-14  3:00 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote:
> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> For disk filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory. Negative dentries are linked into siblings list
> along with normal positive dentries. Some operations walks dcache tree but
> looks only for positive dentries: most important is fsnotify/inotify.
> 
> This patch moves negative dentries to the end of list at final dput() and
> marks with flag which tells that all following dentries are negative too.
> Reverse operation is required before instantiating negative dentry.

> +static void sweep_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	if (!d_is_tail_negative(dentry)) {
> +		parent = lock_parent(dentry);
> +		if (!parent)
> +			return;
> +
> +		if (!d_count(dentry) && d_is_negative(dentry) &&
> +		    !d_is_tail_negative(dentry)) {
> +			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> +			list_move_tail(&dentry->d_child, &parent->d_subdirs);
> +		}
> +
> +		spin_unlock(&parent->d_lock);
> +	}
> +}

Ugh...  So when dput() drives the refcount down to 0 you hit lock_parent()
and only then bother to check if the sucker had been negative in the first
place?

> @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
>  {
>  	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
>  	if (inode) {
> +		if (d_is_tail_negative(entry))
> +			recycle_negative(entry);
>  		security_d_instantiate(entry, inode);
>  		spin_lock(&inode->i_lock);
>  		__d_instantiate(entry, inode);

Wait a bloody minute.  What about d_instantiate_new() right next to it?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings
  2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
  2021-04-14  3:00   ` Al Viro
@ 2021-04-14  3:41   ` Al Viro
  2021-04-15 16:25     ` Al Viro
  1 sibling, 1 reply; 14+ messages in thread
From: Al Viro @ 2021-04-14  3:41 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote:

> +static void sweep_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	if (!d_is_tail_negative(dentry)) {
> +		parent = lock_parent(dentry);
> +		if (!parent)
> +			return;

Wait a minute.  It's not a good environment for calling lock_parent().
Who said that dentry won't get freed right under it?

Right now callers of __lock_parent() either hold a reference to dentry
*or* are called for a positive dentry, with inode->i_lock held.
You are introducing something very different - 

>  		if (likely(retain_dentry(dentry))) {
> +			if (d_is_negative(dentry))
> +				sweep_negative(dentry);
>  			spin_unlock(&dentry->d_lock);

Here we can be called for a negative dentry with refcount already *NOT*
held by us.  Look:

static inline struct dentry *lock_parent(struct dentry *dentry)
{
        struct dentry *parent = dentry->d_parent;
	if (IS_ROOT(dentry))
		return NULL;
isn't a root

	if (likely(spin_trylock(&parent->d_lock)))
		return parent;

no such luck - someone's already holding parent's ->d_lock

	return __lock_parent(dentry);
and here we have
static struct dentry *__lock_parent(struct dentry *dentry)
{
	struct dentry *parent;
	rcu_read_lock();  

OK, anything we see in its ->d_parent is guaranteed to stay
allocated until we get to matching rcu_read_unlock()

	spin_unlock(&dentry->d_lock);
dropped the spinlock, now it's fair game for d_move(), d_drop(), etc.

again:
	parent = READ_ONCE(dentry->d_parent);
dentry couldn't have been reused, so it's the last value stored there.
Points to still allocated struct dentry instance, so we can...

	spin_lock(&parent->d_lock);
grab its ->d_lock.

	/*
	 * We can't blindly lock dentry until we are sure
	 * that we won't violate the locking order.
	 * Any changes of dentry->d_parent must have
	 * been done with parent->d_lock held, so
	 * spin_lock() above is enough of a barrier
	 * for checking if it's still our child.
	 */
	if (unlikely(parent != dentry->d_parent)) {
		spin_unlock(&parent->d_lock);
		goto again;
	}
Nevermind, it's still equal to our ->d_parent.  So we have
the last valid parent's ->d_lock held

	rcu_read_unlock();
What's to hold dentry allocated now?  IF we held its refcount - no
problem, it can't go away.  If we held its ->d_inode->i_lock - ditto
(it wouldn't get to __dentry_kill() until we drop that, since all
callers do acquire that lock and it couldn't get scheduled for
freeing until it gets through most of __dentry_kill()).

IOW, we are free to grab dentry->d_lock again.
	if (parent != dentry)
		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	else
		parent = NULL;
	return parent;
}

With your patch, though, you've got a call site where neither condition
is guaranteed.  Current kernel is fine - we are holding ->d_lock there,
and we don't touch dentry after it gets dropped.  Again, it can't get
scheduled for freeing until after we drop ->d_lock, so we are safe.
With that change, however, you've got a hard-to-hit memory corruptor
there...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 6/6] dcache: prevent flooding with negative dentries
  2021-01-21 13:19 ` [PATCH RFC 6/6] dcache: prevent flooding with negative dentries Gautham Ananthakrishna
@ 2021-04-14  3:56   ` Al Viro
  0 siblings, 0 replies; 14+ messages in thread
From: Al Viro @ 2021-04-14  3:56 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Thu, Jan 21, 2021 at 06:49:45PM +0530, Gautham Ananthakrishna wrote:

> +	spin_lock(&victim->d_lock);
> +	parent = lock_parent(victim);
> +
> +	rcu_read_unlock();

Similar story.  As soon as you hit that rcu_read_unlock(), the memory
pointed to by victim might be reused.  If you have hit __lock_parent(),
victim->d_lock had been dropped and regained.  Which means that freeing
might've been already scheduled.  Unlike #1/6, here you won't get
memory corruption in lock_parent() itself, but...

> +
> +	if (d_count(victim) || !d_is_negative(victim) ||
> +	    (victim->d_flags & DCACHE_REFERENCED)) {
> +		if (parent)
> +			spin_unlock(&parent->d_lock);
> +		spin_unlock(&victim->d_lock);

... starting from here you just might.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings
  2021-04-14  3:41   ` Al Viro
@ 2021-04-15 16:25     ` Al Viro
  0 siblings, 0 replies; 14+ messages in thread
From: Al Viro @ 2021-04-15 16:25 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Wed, Apr 14, 2021 at 03:41:10AM +0000, Al Viro wrote:

> > +	if (!d_is_tail_negative(dentry)) {
> > +		parent = lock_parent(dentry);
> > +		if (!parent)
> > +			return;
> 
> Wait a minute.  It's not a good environment for calling lock_parent().
> Who said that dentry won't get freed right under it?

[snip]

FWIW, in that state (dentry->d_lock held) we have
	* stable ->d_flags
	* stable ->d_count
	* stable ->d_inode
IOW, we can bloody well check ->d_count *before* bothering with lock_parent().
It does not get rid of the problems with lifetimes, though.  We could
do something along the lines of

	rcu_read_lock()
	if retain_dentry()
		parent = NULL
		if that dentry might need to be moved in list
			parent = lock_parent()
			// if reached __dentry_kill(), d_count() will be -128,
			// so the check below will exclude those
			if that dentry does need to be moved
				move it to the end of list
		unlock dentry and parent (if any)
		rcu_read_unlock()
		return
here, but your other uses of lock_parent() also need attention.  And
recursive call of dput() in trim_negative() (#6/6) is really asking
for trouble.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings
  2021-04-14  3:00   ` Al Viro
@ 2021-04-15 16:50     ` Al Viro
  0 siblings, 0 replies; 14+ messages in thread
From: Al Viro @ 2021-04-15 16:50 UTC (permalink / raw)
  To: Gautham Ananthakrishna
  Cc: linux-kernel, linux-fsdevel, linux-mm, matthew.wilcox, khlebnikov

On Wed, Apr 14, 2021 at 03:00:48AM +0000, Al Viro wrote:

> Ugh...  So when dput() drives the refcount down to 0 you hit lock_parent()
> and only then bother to check if the sucker had been negative in the first
                                              ^^^^^^^^^^^^^^^^^
                                              had zero refcount, of course.
> place?

> > @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
> >  {
> >  	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> >  	if (inode) {
> > +		if (d_is_tail_negative(entry))
> > +			recycle_negative(entry);
> >  		security_d_instantiate(entry, inode);
> >  		spin_lock(&inode->i_lock);
> >  		__d_instantiate(entry, inode);
> 
> Wait a bloody minute.  What about d_instantiate_new() right next to it?

Another fun question: where's the proof that __d_add(dentry, non_NULL_inode)
won't happen to dentry marked tail-negative?  From a quick grep I see at
least one such place - on success cifs_do_create() does
        d_drop(direntry);
	d_add(direntry, newinode);
and it would bloody well evade what you are doing in d_instantiate().
Same seems to be true for nfs_link()...

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-04-15 16:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 13:19 [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Gautham Ananthakrishna
2021-01-21 13:19 ` [PATCH RFC 1/6] dcache: sweep cached negative dentries to the end of list of siblings Gautham Ananthakrishna
2021-04-14  3:00   ` Al Viro
2021-04-15 16:50     ` Al Viro
2021-04-14  3:41   ` Al Viro
2021-04-15 16:25     ` Al Viro
2021-01-21 13:19 ` [PATCH RFC 2/6] fsnotify: stop walking child dentries if remaining tail is negative Gautham Ananthakrishna
2021-01-21 13:19 ` [PATCH RFC 3/6] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Gautham Ananthakrishna
2021-01-21 13:19 ` [PATCH RFC 4/6] dcache: stop walking siblings if remaining dentries all negative Gautham Ananthakrishna
2021-01-21 13:19 ` [PATCH RFC 5/6] dcache: push releasing dentry lock into sweep_negative Gautham Ananthakrishna
2021-01-21 13:19 ` [PATCH RFC 6/6] dcache: prevent flooding with negative dentries Gautham Ananthakrishna
2021-04-14  3:56   ` Al Viro
2021-03-31 14:23 ` [PATCH RFC 0/6] fix the negative dentres bloating system memory usage Matthew Wilcox
2021-04-14  2:40 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).