[PATCH] vfs: add simple direct-mapped dcache lookup front-end

* [PATCH] vfs: add simple direct-mapped dcache lookup front-end
@ 2016-06-11 23:51 George Spelvin
  2016-06-12  1:16 ` Randy Dunlap
  2016-06-13  3:20 ` [PATCH v2] " George Spelvin
  0 siblings, 2 replies; 3+ messages in thread
From: George Spelvin @ 2016-06-11 23:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux, mszeredi, npiggin, paulmck, torvalds, viro

This is an old patch by Linus that he asked if I could fix the race
conditions in.  Posted for comment on the RCU abuse (search for "Evil
RCU Hack") and performance in general.

[Linus speaking, Thu May 31, 2012]

I've pushed __d_lookup_rcu() just about as far as I could, and it still
had some problems.

The problems were mainly due to:

- the complexity of the slow-case handling causes register spills,
- the hash chain lookup loop causes not only register pressure, but
  also the extra magic "mask off lock bit from the hash chain head
  pointer" etc. logic, and
- the hash list needs to be dynamically sized (we want *big* caches, but
  you can't use the same size for big and small machines), which causes
  the initial hash lookup itself to be more complex.

This looks like a viable solution to all three problems, and it is
actually surprisingly simple: make a trivial fixed-size direct-mapped L1
dentry cache.  No chains, no locking, no nothing.

This gives measurable improvement on my microbenchmark, and gets good
hit-rates on both kernel compiles and even on something like "updatedb",
which I'd have expected to be one of the worst possible cases.
Apparently updatedb still ends up looking up the same files (/etc/fstab
etc) a lot.  So those good hit-rates seem to often be due to really
stupid programming, but hey, I think we all agree that "stupid
programming" is likely the common case that we generally do need to also
optimize for ;)

For my kernel compile benchmark ("make -j" on a fully built tree), the
profile shows (this is kernel-only profile, so user space overhead
removed):

8.19%  [k] link_path_walk
7.74%  [k] __d_lookup_rcu
5.66%  [k] selinux_inode_permission
3.73%  [k] do_lookup
2.86%  [k] path_lookupat
2.72%  [k] avc_has_perm_noaudit
2.71%  [k] inode_has_perm.isra.49.constprop.71
2.68%  [k] avc_lookup
2.51%  [k] generic_permission
...
0.78%  [k] __d_lookup_rcu_slow
...

where "__d_lookup_rcu_slow()" is the exact same old __d_lookup_rcu(), so
it's not really "slow", but it's quite noticeably slower than the new
streamlined __d_lookup_rcu().  And as you can tell, that means that we
must have a 90%+ hitrate in the new L1 dcache lookup, since we only see
10% as much time in the slow routine as in the L1 front-end.

[George Spelvin speaking]

I fixed the race conditions in Linus's code, added Kconfig support,
and a number of comments.

I have two concerns about the performance of this code:

1) Since it was first written, RCU dcache lookup has gone completely
   lockless and is even faster.
2) By adding a shared data structure that is written randomly by multiple
   CPUs, this patch undoes a lot of that optimization.  Even if it's a
   win on a single-socket system, it may be a net loss on a larger one.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
---
[Damn it, mail aliases screwed up.]

 fs/Kconfig  |  28 ++++++++
 fs/dcache.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 231 insertions(+), 4 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index b8fcb416..3b111b77 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -8,6 +8,34 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config L1_DCACHE_BITS
+	int "Dcache level-1 cache size (bits)"
+	range 0 20
+	default 0 if !EXPERT
+	default 0 if NUMA
+	default 10 if BASE_SMALL
+	default 13
+	help
+	  The Linus kernel maintains a large cache of "dentries"
+	  (directory entries) for the performance-critical task of
+	  converting file names to inodes.  This option enables a smaller
+	  direct-mapped "level-1 cache" in front of the main dcache.
+
+	  (This software "dcache" is quite different from the CPU's data
+	  cache, or "D-cache".	Sorry for the confusingly similar names.)
+
+	  This option specifies the size of this cache, as a power of 2.
+	  For example, 13 means 2^13 = 8192 entries in the L1 dcache.
+	  Specify 0 to turn off the L1 dcache entirely.
+
+	  The cost of enabling this is one pointer per entry, plus a
+	  small amount of code.
+
+	  This is an experimental feature which hopes to speed up
+	  single-socket machines.  On larger systems, the extra updates
+	  generated by the L1 dcache probably cause too much cache-line
+	  bouncing to be worth it.
+
 if BLOCK
 
 source "fs/ext2/Kconfig"
diff --git a/fs/dcache.c b/fs/dcache.c
index 35c989f2..941f5232 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -278,6 +278,44 @@ static inline int dname_external(const struct dentry *dentry)
 	return dentry->d_name.name != dentry->d_iname;
 }
 
+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * Reschedule RCU reclamation for the *next* grace period.
+ * That's a longer wait than usually necessary, but this is so rare
+ * that optimizing it is not important.
+ */
+static void __d_free_delayed(struct rcu_head *head)
+{
+	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+
+	if (dname_external(dentry))
+		call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+	else
+		call_rcu(&dentry->d_u.d_rcu, __d_free);
+}
+
+/*
+ * This is an Evil RCU Hack.  A rare race condition leaks an RCU-protected
+ * pointer to a freed dentry to another CPU.  An even rarer race results
+ * in the first CPU exiting an RCU read locked region and triggering
+ * the end of the grace period while the second CPU is still using the
+ * pointer.  So if we detect the first race happening, we call this to
+ * defer the RCU reclamation for a second grace period.
+ *
+ * The evil part: we can't remove it from the singly-linked RCU callback
+ * list, but we can safely change the callback function out from under
+ * the RCU code.
+ *
+ * The other evil part: This is a harmless no-op if the dentry is not
+ * on the RCU list at all.  We just need to be sure that the d_alias
+ * list, which overlaps with d_rcu, is not used.
+ */
+static void dentry_delay_free(struct dentry *dentry)
+{
+	WRITE_ONCE(dentry->d_u.d_rcu.func, __d_free_delayed);
+}
+#endif
+
 static inline void __d_set_inode_and_type(struct dentry *dentry,
 					  struct inode *inode,
 					  unsigned type_flags)
@@ -309,6 +347,7 @@ static void dentry_free(struct dentry *dentry)
 			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
+		dentry->d_name.name = dentry->d_iname;
 	}
 	/* if dentry was never visible to RCU, immediate free is OK */
 	if (!(dentry->d_flags & DCACHE_RCUACCESS))
@@ -459,6 +498,120 @@ static void dentry_lru_add(struct dentry *dentry)
 		d_lru_add(dentry);
 }
 
+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * In front of the main dcache, we have an inclusive "L1 dcache".
+ * This is a simple direct-mapped hash table.
+ *
+ * The L1 dcache is *written* to locklessly, which is fragile as heck.
+ * The main point is that write/write races don't matter.  As long
+ * as the reader gets *some* valid dentry pointer, it doesn't much
+ * matter which.
+ *
+ * The nasty cases occur when we add an entry to the L1 cache as it's
+ * being deallocated.  There's no way to atomically check for validity
+ * and store, so we have to notice the problem after storing the entry and
+ * undo the damage.  RCU helps a lot here, but doesn't solve everything.
+ */
+
+/*
+ * This has a NULL parent and zero length, and will thus
+ * never match anything. But it means that the dcache_l1
+ * array never contains NULL, so you don't need to check.
+ */
+static struct dentry invalid_dentry;
+
+#define L1_DCACHE_SIZE (1u << CONFIG_L1_DCACHE_BITS)
+
+static struct dentry *dcache_l1[L1_DCACHE_SIZE] = {
+	[0 ... L1_DCACHE_SIZE-1] = &invalid_dentry
+};
+
+static unsigned int dentry_l1_index(unsigned int hash, const struct dentry *dir)
+{
+	hash += (unsigned long) dir / L1_CACHE_BYTES;
+	hash = hash + (hash >> CONFIG_L1_DCACHE_BITS);
+	return hash & (L1_DCACHE_SIZE-1);
+}
+
+static struct dentry *d_lookup_l1(unsigned int hash, const struct dentry *dir)
+{
+	unsigned int n = dentry_l1_index(hash, dir);
+
+	return READ_ONCE(dcache_l1[n]);
+}
+
+static void d_remove_from_l1(const struct dentry *dentry)
+{
+	unsigned int n = dentry_l1_index(dentry->d_name.hash, dentry->d_parent);
+
+	WRITE_ONCE(dcache_l1[n], &invalid_dentry);
+}
+
+/*
+ * Add a dentry to L1, and take it right back out again if it was changed
+ * in the meantime.  There are two races we have to worry about:
+ * 1) A race with __d_drop (but not __d_kill):
+ *    An invalid entry just causes an L1 miss, but we have to be sure that
+ *    *someone* removes the dentry from L1.  __d_drop removes it *after*
+ *    bumping the seqno, so if we don't see sequence jump, it's safe to
+ *    leave it in L1.
+ * 2) A race with __dentry_kill.  This is a *very* narrow race, and probably
+ *    impossible to hit without preemption, but it exists.
+ *
+ *    Suppose that just as we (on CPU #1) entered this function, CPU
+ *    #2 performed __dentry_kill.  I.e. after we checked the dentry
+ *    for validity but before the actual store.  We noticed the problem
+ *    *after* the store and will pull the entry out, but the following
+ *    events happen in the following order:
+ *    a) CPU #2 schedules reclamation with call_rcu(__d_free).
+ *    b) CPU #3 enters an RCU read-side critical section and fetches
+ *       the pointer from the L1 array.
+ *    c) Then CPU #1 invalidates the L1 cache entry.
+ *    d) Then CPU #1 goes RCU-idle (e.g. returns to user space).
+ *    e) This ends an RCU grace period and triggers reclamation.
+ *    g) Then CPU #3 tries to read from the reclaimed dentry.  Oops!
+ *    The problem is that RCU depends on knowing that, after call_rcu(),
+ *    only CPUs already in read-side critical sections can possibly
+ *    access the object.  By publishing the pointer in dcache_l1[], we
+ *    broke that guarantee.
+ *
+ *    The solution is to keep the dentry alive for a little bit longer.
+ */
+static bool d_add_to_l1(struct dentry *dentry, unsigned int hash,
+	const struct dentry *parent, unsigned int seq)
+{
+	unsigned int n = dentry_l1_index(hash, parent);
+	bool b;
+
+	WRITE_ONCE(dcache_l1[n], dentry);
+	smp_mb();	/* Pairs with dentry_rcuwalk_invalidate in __d_drop */
+	b = __read_seqcount_retry(&dentry->d_seq, seq);
+	if (unlikely(b)) {
+		WRITE_ONCE(dcache_l1[n], &invalid_dentry);
+		smp_mb();	/* Pairs with lock/unlock(dentry->d_lock) */
+		if (dentry->d_flags & DCACHE_DENTRY_KILLED)
+			dentry_delay_free(dentry);
+	}
+	return !b;
+}
+#else
+static inline void d_remove_from_l1(const struct dentry *dentry)
+{
+	(void)dentry;
+}
+
+static inline bool d_add_to_l1(struct dentry *dentry, unsigned int hash,
+	const struct dentry *parent, unsigned int seq)
+{
+	(void)dentry;
+	(void)hash;
+	(void)parent;
+	(void)seq;
+	return true;
+}
+#endif
+
 /**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
@@ -493,6 +646,7 @@ void __d_drop(struct dentry *dentry)
 		dentry->d_hash.pprev = NULL;
 		hlist_bl_unlock(b);
 		dentry_rcuwalk_invalidate(dentry);
+		d_remove_from_l1(dentry);
 	}
 }
 EXPORT_SYMBOL(__d_drop);
@@ -2058,7 +2212,7 @@ static noinline enum slow_d_compare slow_dentry_cmp(
 }
 
 /**
- * __d_lookup_rcu - search for a dentry (racy, store-free)
+ * __d_lookup_rcu_slow - search for a dentry (racy, store-free)
  * @parent: parent dentry
  * @name: qstr of name we wish to find
  * @seqp: returns d_seq value at the point where the dentry was found
@@ -2086,7 +2240,7 @@ static noinline enum slow_d_compare slow_dentry_cmp(
  * NOTE! The caller *has* to check the resulting dentry against the sequence
  * number we've returned before using any of the resulting dentry state!
  */
-struct dentry *__d_lookup_rcu(const struct dentry *parent,
+static noinline struct dentry *__d_lookup_rcu_slow(const struct dentry *parent,
 				const struct qstr *name,
 				unsigned *seqp)
 {
@@ -2157,12 +2311,57 @@ seqretry:
 		if (dentry->d_name.hash_len != hashlen)
 			continue;
 		*seqp = seq;
-		if (!dentry_cmp(dentry, str, hashlen_len(hashlen)))
-			return dentry;
+		if (dentry_cmp(dentry, str, hashlen_len(hashlen)))
+			continue;
+		if (!d_add_to_l1(dentry, hashlen_hash(hashlen), parent, seq))
+			goto seqretry;
+		return dentry;
 	}
 	return NULL;
 }
 
+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * Fast non-chained L1 hash lookup.
+ *
+ * This queries a fixed-size non-chained "level 1" hash table
+ * before falling back to the main dcache.
+ *
+ * The lookup is simplified by not putting anything in the hash
+ * table that would complicate the fast path.  In particular:
+ * - We don't need to worry about NULL pointers because the cache
+ *   is filled with dummy sentinel dentries that match nothing.
+ * - We don't need to worry about DCACHE_OP_COMPARE, because such
+ *   dentries are never added to the L1 cache.
+ */
+struct dentry *__d_lookup_rcu(const struct dentry *parent,
+				const struct qstr *name,
+				unsigned int *seqp)
+{
+	u64 hashlen = name->hash_len;
+	struct dentry *dentry = d_lookup_l1(hashlen_hash(hashlen), parent);
+	unsigned int seq = raw_seqcount_begin(&dentry->d_seq);
+
+	do {
+		if (unlikely(dentry->d_parent != parent))
+			break;
+		if (unlikely(dentry->d_name.hash_len != hashlen))
+			break;
+		*seqp = seq;
+		if (unlikely(dentry_cmp(dentry, name->name, hashlen_len(hashlen))))
+			break;
+		if (unlikely(d_unhashed(dentry)))
+			break;
+		return dentry;
+	} while (0);
+	return __d_lookup_rcu_slow(parent, name, seqp);
+}
+#else
+__alias(__d_lookup_rcu_slow) struct dentry *__d_lookup_rcu(
+	const struct dentry *parent,
+	const struct qstr *name, unsigned int *seqp);
+#endif
+
 /**
  * d_lookup - search for a dentry
  * @parent: parent dentry
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread