linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] Rosebush, a new hash table
@ 2024-02-22 20:37 Matthew Wilcox (Oracle)
  2024-02-22 20:37 ` [PATCH 1/1] rosebush: Add new data structure Matthew Wilcox (Oracle)
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-22 20:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox (Oracle),
	Thomas Graf, Herbert Xu, netdev, linux-fsdevel, maple-tree, rcu

Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
I've written a load of documentation about how it works, mostly in
Documentation/core-api/rosebush.rst but some is dotted through the
rosebush.c file too.

You can see this code as a further exploration of the "Linked lists are
evil" design space.  For the workloads which a hashtable is suited to,
it has lower overhead than either the maple tree or the rhashtable.
It cannot support ranges (so is not a replacement for the maple tree),
but it does have per-bucket locks so is more scalable for write-heavy
workloads.  I suspect one could reimplement the rhashtable API on top
of the rosebush, but I am not interested in doing that work myself.

The per-object overhead is 12 bytes, as opposed to 16 bytes for the
rhashtable and 32 bytes for the maple tree.  The constant overhead is also
small, being just 16 bytes for the struct rosebush.  The exact amount
of memory consumed for a given number of objects is going to depend on
the distribution of hashes; here are some estimated consumptions for
power-of-ten entries distributed evenly over a 32-bit hash space in the
various data structures:

number	xarray	maple	rhash	rosebush
1	3472	272	280	256
10	32272	784	424	256
100	262kB	3600	1864	2080
1000	[1]	34576	17224	16432
10k	[1]	343k	168392	131344
100k	[1]	3.4M	1731272	2101264

As you can see, rosebush and rhashtable are close the whole way.
Rosebush moves in larger chunks because it doubles each time; there's
no actual need to double the bucket size, but that works well with
the slab allocator's existing slabs.  As noted in the documentation,
we could create our own slabs and get closer to the 12 bytes per object
minimum consumption. [2]

Where I expect rosebush to shine is on dependent cache misses.
I've assumed an average chain length of 10 for rhashtable in the above
memory calculations.  That means on average a lookup would take five cache
misses that can't be speculated.  Rosebush does a linear walk of 4-byte
hashes looking for matches, so the CPU can usefully speculate the entire
array of hash values (indeed, we tell it exactly how many we're going to
look at) and then take a single cache miss fetching the correct pointer.
Add that to the cache miss to fetch the bucket and that's just two cache
misses rather than five.

I have not yet converted any code to use the rosebush.  The API is
designed for use by the dcache, and I expect it will evolve as it actually
gets used.  I think there's probably some more refactoring to be done.
I am not aware of any bugs, but the test suite is pitifully small.
The current algorithm grows the buckets more aggressively than the table;
that's probably exactly the wrong thing to do for good performance.

This version is full of debugging printks.  You should probably take
them out if you're going to try to benchmark it.  The regex '^print'
should find them all.  Contributions welcome; I really want to get back
to working on folios, but this felt like an urgent problem to be fixed.

[1] I stopped trying to estimate the memory costs for xarray; I couldn't
be bothered to as it's not a serious competitor for this use case.

[2] We have ideas for improving the maple tree memory consumption for
this kind of workload; a new node type for pivots that fit in 4 bytes and
sparse nodes to avoid putting a NULL entry after each occupied entry.
The maple tree really is optimised for densely packed ranges at the
moment.

Matthew Wilcox (Oracle) (1):
  rosebush: Add new data structure

 Documentation/core-api/index.rst    |   1 +
 Documentation/core-api/rosebush.rst | 135 ++++++
 MAINTAINERS                         |   8 +
 include/linux/rosebush.h            |  41 ++
 lib/Kconfig.debug                   |   3 +
 lib/Makefile                        |   3 +-
 lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
 lib/test_rosebush.c                 | 135 ++++++
 8 files changed, 1032 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/core-api/rosebush.rst
 create mode 100644 include/linux/rosebush.h
 create mode 100644 lib/rosebush.c
 create mode 100644 lib/test_rosebush.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/1] rosebush: Add new data structure
  2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
@ 2024-02-22 20:37 ` Matthew Wilcox (Oracle)
  2024-02-25  6:38   ` Al Viro
  2024-02-23 11:37 ` [PATCH 0/1] Rosebush, a new hash table Peng Zhang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2024-02-22 20:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox (Oracle),
	Thomas Graf, Herbert Xu, netdev, linux-fsdevel, maple-tree, rcu

Rosebush is a resizing hash table.  See
Docuemntation/core-api/rosebush.rst for details.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/core-api/index.rst    |   1 +
 Documentation/core-api/rosebush.rst | 135 ++++++
 MAINTAINERS                         |   8 +
 include/linux/rosebush.h            |  41 ++
 lib/Kconfig.debug                   |   3 +
 lib/Makefile                        |   3 +-
 lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
 lib/test_rosebush.c                 | 135 ++++++
 8 files changed, 1032 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/core-api/rosebush.rst
 create mode 100644 include/linux/rosebush.h
 create mode 100644 lib/rosebush.c
 create mode 100644 lib/test_rosebush.c

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 7a3a08d81f11..380dfb30b073 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -36,6 +36,7 @@ Library functionality that is used throughout the kernel.
    kobject
    kref
    assoc_array
+   rosebush
    xarray
    maple_tree
    idr
diff --git a/Documentation/core-api/rosebush.rst b/Documentation/core-api/rosebush.rst
new file mode 100644
index 000000000000..ed17b2572f06
--- /dev/null
+++ b/Documentation/core-api/rosebush.rst
@@ -0,0 +1,135 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========
+Rosebush
+========
+
+:Author: Matthew Wilcox
+
+Overview
+========
+
+Rosebush is a hashtable, different from the rhashtable.  It is scalable
+(one spinlock per bucket), resizing in two dimensions (number and size
+of buckets), and concurrent (can be iterated under the RCU read lock).
+It is designed to minimise dependent cache misses, which can stall a
+modern CPU for thousands of instructions.
+
+Objects stored in a rosebush do not have an embedded linked list.
+They can therefore be placed into the same rosebush multiple times and
+be placed in multiple rosebushes.  It is also possible to store pointers
+which have special meaning like ERR_PTR().  It is not possible to store
+a NULL pointer in a rosebush, as this is the return value that indicates
+the iteration has finished.
+
+The user of the rosebush is responsible for calculating their own hash.
+A high quality hash is desirable to keep the scalable properties of
+the rosebush, but a hash with poor distribution in the lower bits will
+not lead to a catastrophic breakdown.  It may lead to excessive memory
+consumption and a lot of CPU time spent during lookup.
+
+Rosebush is not yet IRQ or BH safe.  It can be iterated in interrupt
+context, but not modified.
+
+RCU Iteration
+-------------
+
+There is no rosebush_lookup() function.  This is because the hash value
+may not be unique.  Instead, the user should iterate the rosebush,
+which will return pointers which probably have matching hash values.
+It is the user's responsibility to determine if the returned pointer is
+one they are interested in.
+
+Rosebush iteration guarantees to return all pointers which have a
+matching hash, were in the rosebush before the iteration started and
+remain in the rosebush after iteration ends.  It may return additional
+pointers, including pointers which do not have a matching hash value,
+but it guarantees not to skip any pointers, and it guarantees to only
+return pointers which have (at some point) been stored in the rosebush.
+
+If the rosebush is modified while the iteration is in progress, newly
+added entries may or may not be returned and removed entries may or may
+not be returned.  Causality is not honoured; e.g. if Entry A is known
+to be inserted before Entry B, it is possible for an iteration to return
+Entry B and not Entry A.
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/rosebush.h
+.. kernel-doc:: lib/rosebush.c
+
+Internals
+=========
+
+The rosebush is organised into a top level table which contains pointers
+to buckets.  Each bucket contains a spinlock (for modifications to the
+bucket), the number of entries in the bucket and a contention counter.
+
+The top level table is a power of two in size.  The bottom M bits of
+the hash are used to index into this table, which contains a bucket.
+The bucket contains W hash values followed by W pointers.  We also track
+a contention count, which lets us know if this spinlock is overloaded
+and we should split the bucket to improve scalability.
+
+If the bucket is full, we can increase the size of the bucket.  Currently
+we double the size of the bucket because the slab allocator has slabs for
+powers of two.  We could use any size bucket, or even create slab caches
+for our own buckets (eg a 584 byte bucket would give us 48 entries and
+we'd get 7 buckets per page).  The larger the bucket is, the longer it
+will take to walk the bucket, for modifications and lookup, so there's
+a reasonable limit on the size of an individual bucket.
+
+When we decide that the table needs to be resized, we allocate a new
+table, and duplicate the current contents of the table into each half
+of the new table.  At this point, all buckets in the table are shared.
+A bucket may be shared between multiple table entries.  For simplicity,
+we require that all buckets are shared between a power-of-two number
+of slots.  For example, a table with 8 entries might have entries that
+point to buckets 0, 1, 0, 1, 0, 2, 0, 3.  If we were to then split bucket
+0, we would have to replace either the first or last pair of pointers
+with pointers to bucket 4.  This is akin to the buddy page allocator.
+
+We need to decide when to unshare a bucket.  The most eager option would
+be to do it at table resize, but this seems like a high penalty
+for the unlucky caller who causes a table resize.  Slightly more lazy is
+when we're about to insert into a bucket and discover that it is shared.
+Lazier again would be when we discover that a bucket has no free entries
+in it.  Laziest of all would be to only unshare a bucket when it has been
+grown to the maximum size possible.  All these options have consequences
+for lookup, insertion & deletion time.
+
+In all but the first case, we also need to decide how far to unshare
+the bucket.  For example, if a bucket is currently shared between eight
+slots, do we allocate two new buckets (shared between four slots each),
+three new buckets (one for four slots, two for two slots) or four new
+buckets (one for four slots, one for two slots, and two unshared buckets)?
+Unless the hash is of poor quality, or one particular bucket is highly
+contended, we are unlikely to encounter this situation.  We also need
+to decide how large a bucket to allocate when unsharing a bucket; we
+can count the number of entries that will be in it and choose a bucket
+at least x% larger.  Or we can assume that the new bucket will grow to
+at least the current size.
+
+IRQ / BH safety
+---------------
+
+If we decide to make the rosebush modifiable in IRQ context, we need
+to take the locks in an irq-safe way; we need to figure out how to
+allocate the top level table without vmalloc(), and we need to manage
+without kvfree_rcu_mightsleep().  These all have solutions, but those
+solutions have a cost that isn't worth paying until we have users.
+
+Some of those problems go away if we limit our support to removal in IRQ
+context and only allow insertions in process context (as we do not need
+to reallocate the table or bucket when removing an item).
+
+Small rosebushes
+----------------
+
+As an optimisation, if the rosebush has no entries, the buckets pointer
+is NULL.  If the rosebush has only a few entries, there are only two
+buckets (allocated as a single allocation) and the table pointer points
+directly to the first one instead of pointing to a table.  In this case,
+both buckets are very small and the rosebush is likely to allocate a
+table and separate buckets very early on.
diff --git a/MAINTAINERS b/MAINTAINERS
index 722b894f305e..d8296f0aebf2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19008,6 +19008,14 @@ F:	include/net/rose.h
 F:	include/uapi/linux/rose.h
 F:	net/rose/
 
+ROSEBUSH DATA STRUCTURE
+M:	Matthew Wilcox <willy@infradead.org>
+L:	maple-tree@lists.infradead.org
+S:	Supported
+F:	Documentation/core-api/rosebush.rst
+F:	include/linux/rosebush.h
+F:	lib/*rosebush.c
+
 ROTATION DRIVER FOR ALLWINNER A83T
 M:	Jernej Skrabec <jernej.skrabec@gmail.com>
 L:	linux-media@vger.kernel.org
diff --git a/include/linux/rosebush.h b/include/linux/rosebush.h
new file mode 100644
index 000000000000..8f96f6f9743e
--- /dev/null
+++ b/include/linux/rosebush.h
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0+
+/* See lib/rosebush.c */
+
+#include <linux/spinlock.h>
+
+/*
+ * Embed this struct in your struct, don't allocate it separately.
+ * None of this is for customers to use; treat it as opaque.
+ * In particular, taking the rbh_resize_lock will prevent only rbh_table
+ * from being reallocated; buckets can still be grown and split without
+ * the lock.  But you will get incomprehensible lockdep warnings!
+ */
+struct rbh {
+	spinlock_t	rbh_resize_lock;
+	unsigned long	rbh_table;	/* A tagged pointer */
+};
+
+#define DEFINE_ROSEBUSH(name)	struct rbh name = \
+	{ .rbh_resize_lock = __SPIN_LOCK_UNLOCKED(name.lock), }
+
+int rbh_insert(struct rbh *rbh, u32 hash, void *p);
+int rbh_reserve(struct rbh *rbh, u32 hash);
+int rbh_use(struct rbh *rbh, u32 hash, void *p);
+int rbh_remove(struct rbh *rbh, u32 hash, void *p);
+
+struct rbh_iter {
+	struct rbh *rbh;
+	struct rbh_bucket *bucket;
+	u32 hash;
+	unsigned int index;
+};
+
+#define RBH_ITER(name, _rbh, _hash)					\
+	struct rbh_iter name = { .rbh = _rbh, .hash = _hash, }
+
+void *rbh_next(struct rbh_iter *rbhi);
+
+void rbh_iter_remove(struct rbh_iter *rbhi, void *p);
+void rbh_iter_lock(struct rbh_iter *rbhi);
+void rbh_iter_unlock(struct rbh_iter *rbhi);
+
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 975a07f9f1cc..d78f5effa310 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2396,6 +2396,9 @@ config TEST_RHASHTABLE
 
 	  If unsure, say N.
 
+config TEST_ROSEBUSH
+	tristate "Test the Rosebush data structure"
+
 config TEST_IDA
 	tristate "Perform selftest on IDA functions"
 
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e61..61592aa5374d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ CFLAGS_string.o += -fno-stack-protector
 endif
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
-	 rbtree.o radix-tree.o timerqueue.o xarray.o \
+	 rosebush.o rbtree.o radix-tree.o timerqueue.o xarray.o \
 	 maple_tree.o idr.o extable.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
@@ -75,6 +75,7 @@ obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
 obj-$(CONFIG_TEST_MIN_HEAP) += test_min_heap.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
+obj-$(CONFIG_TEST_ROSEBUSH) += test_rosebush.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_SORT) += test_sort.o
 obj-$(CONFIG_TEST_USER_COPY) += test_user_copy.o
diff --git a/lib/rosebush.c b/lib/rosebush.c
new file mode 100644
index 000000000000..405ec864ef35
--- /dev/null
+++ b/lib/rosebush.c
@@ -0,0 +1,707 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Rosebush, a resizing bucket hash table
+ * Copyright (c) 2024 Oracle Corporation
+ * Author: Matthew Wilcox <willy@infradead.org>
+ */
+
+#include <linux/rosebush.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+/*
+ * The lock is held whenever we are modifying the contents of the bucket.
+ * The contention counter tracks whether we need to split the bucket due
+ * to contention on the spinlock.
+ * Entries tells us how many hashes (and therefore how many slots) we have.
+ * The bucket also contains the slots, but C doesn't let you declare
+ * two unsized arrays next to each other.
+ */
+struct rbh_bucket {
+	spinlock_t	rbh_lock;
+	u16		rbh_contention;
+	u16		rbh_entries;
+	u32		rbh_hashes[] __counted_by(rbh_entries);
+};
+
+struct rbh_table {
+	DECLARE_FLEX_ARRAY(struct rbh_bucket __rcu *, buckets);
+};
+
+#ifdef CONFIG_64BIT
+#define INITIAL_SLOTS	10
+/* 10 (128), 20 (248), 42 (512), 84 (1016), 170 (2048), 340 (4088) */
+#define MAX_SLOTS	340
+#else
+#define INITIAL_SLOTS	15
+/* 15 (128), 31 (256), 63 (512), 127 (1024), 255 (2048, 511 (4096) */
+#define MAX_SLOTS	511
+#endif
+
+static inline unsigned int next_bucket_size(unsigned int nr)
+{
+#ifdef CONFIG_64BIT
+	if (nr % 4 == 0)
+		return nr * 2 + 2;
+	return nr * 2;
+#else
+	return nr * 2 + 1;
+#endif
+}
+
+#define ENTRY_SIZE	(sizeof(int) + sizeof(void *))
+#define BUCKET_SIZE(nr)	(sizeof(struct rbh_bucket) + ENTRY_SIZE * (nr))
+#define INITIAL_SIZE	BUCKET_SIZE(INITIAL_SLOTS)
+
+struct rbh_initial_bucket {
+	struct rbh_bucket b;
+	struct {
+		u32		rbh_hashes[INITIAL_SLOTS];
+		void __rcu *	rbh_slots[INITIAL_SLOTS];
+	} i;
+};
+
+struct rbh_initial_table {
+	struct rbh_initial_bucket buckets[2];
+};
+
+static void __rcu **bucket_slots(const struct rbh_bucket *bucket,
+		unsigned int size)
+{
+	unsigned long p = (unsigned long)&bucket->rbh_hashes[size];
+	return (void __rcu **)((p | (sizeof(void *) - 1)) + 1);
+}
+
+/*
+ * As far as lockdep is concerned, all buckets in the same rosebush use
+ * the same lock.  We use the classes to distinguish the rbh resize lock
+ * from the bucket locks.  The only time we take two bucket locks is
+ * when we already hold the resize lock, so there is no need to define
+ * an order between bucket locks.
+ */
+#ifdef CONFIG_DEBUG_SPINLOCK
+#define bucket_lock_init(rbh, bucket)					\
+	__raw_spin_lock_init(spinlock_check(&(bucket)->rbh_lock),	\
+		"rbh", (rbh)->rbh_resize_lock.dep_map.key, LD_WAIT_SPIN)
+#else
+#define bucket_lock_init(rbh, bucket)					\
+	spin_lock_init(&(bucket)->rbh_lock)
+#endif
+
+#define rbh_resize_lock(rbh)	spin_lock_nested(&(rbh)->rbh_resize_lock, 2)
+#define rbh_resize_unlock(rbh)	spin_unlock(&(rbh)->rbh_resize_lock)
+
+enum split_state {
+	SPLIT_MAYBE,	/* Willing to try to split */
+	SPLIT_SHOULD,	/* Try to split */
+	SPLIT_DONE,	/* Already tried to split a bucket */
+};
+
+struct insert_state {
+	enum split_state split;
+	u32 mask;
+};
+
+/*
+ * A very complicated way of spelling &rbh->bucket[hash].
+ *
+ * The first complication is that we encode the number of buckets
+ * in the pointer so that we can get both in an atomic load.
+ *
+ * The second complication is that small hash tables don't have a top
+ * level table; instead the buckets pointer points to a pair of buckets
+ * of size 128 bytes.
+ *
+ * The third complication is that we reuse the first two slots of the
+ * table for RCU freeing, so we have to check that didn't happen before
+ * we return the pointer.  That does not absolve us of rechecking after
+ * we get the lock in the caller; we have to check here that we have a
+ * pointer to a bucket, and we have to check after we get the lock that
+ * this bucket is still the current bucket.
+ */
+static struct rbh_bucket *get_bucket(struct rbh *rbh, u32 hash,
+		struct insert_state *state)
+{
+	unsigned long tagged;
+	struct rbh_table *table;
+	struct rbh_initial_table *initial_table;
+	u32 mask;
+	unsigned int bid;
+
+	/* rcu_dereference(), but not a pointer */
+	tagged = READ_ONCE(rbh->rbh_table);
+	if (!tagged)
+		return NULL;
+
+	/* The lowest bits indicates how many buckets the table holds */
+	table = (struct rbh_table *)(tagged & (tagged + 1));
+	mask = tagged - (unsigned long)table;
+	bid = hash & mask;
+	if (mask != 1) {
+		if (state) {
+			if (state->split == SPLIT_MAYBE) {
+				u32 buddy = bid ^ ((mask + 1) / 2);
+				if (table->buckets[bid] ==
+				    table->buckets[buddy])
+					state->split = SPLIT_SHOULD;
+				else
+					state->split = SPLIT_DONE;
+			}
+			state->mask = mask;
+		}
+		return rcu_dereference(table->buckets[bid]);
+	}
+	initial_table = (struct rbh_initial_table *)table;
+	if (state) {
+		state->split = SPLIT_DONE;
+		state->mask = mask;
+	}
+	return &initial_table->buckets[bid].b;
+}
+
+static struct rbh_bucket *lock_bucket(struct rbh *rbh, u32 hash)
+	__cond_acquires(&bucket->rbh_lock)
+{
+	struct rbh_bucket *bucket, *new_bucket;
+
+	bucket = get_bucket(rbh, hash, NULL);
+	if (!bucket)
+		return bucket;
+again:
+	spin_lock(&bucket->rbh_lock);
+	new_bucket = get_bucket(rbh, hash, NULL);
+	if (bucket == new_bucket)
+		return bucket;
+	spin_unlock(&bucket->rbh_lock);
+	bucket = new_bucket;
+	goto again;
+}
+
+static bool rbh_first(struct rbh *rbh, u32 hash)
+{
+	struct rbh_initial_table *table;
+	int i;
+
+printk("%s: table size %zd\n", __func__, sizeof(*table));
+	table = kmalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return false;
+
+	rbh_resize_lock(rbh);
+	if (rbh->rbh_table) {
+		rbh_resize_unlock(rbh);
+		/* Somebody else resized it for us */
+		kfree(table);
+		return true;
+	}
+
+	bucket_lock_init(rbh, &table->buckets[0].b);
+	table->buckets[0].b.rbh_entries = INITIAL_SLOTS;
+	table->buckets[0].b.rbh_contention = 0;
+	bucket_lock_init(rbh, &table->buckets[1].b);
+	table->buckets[1].b.rbh_entries = INITIAL_SLOTS;
+	table->buckets[1].b.rbh_contention = 0;
+	for (i = 0; i < INITIAL_SLOTS; i++) {
+		table->buckets[0].b.rbh_hashes[i] = ~0;
+		table->buckets[1].b.rbh_hashes[i] = 0;
+	}
+	/* rcu_assign_pointer() but not a pointer */
+	smp_store_release(&rbh->rbh_table, (unsigned long)table | 1);
+	rbh_resize_unlock(rbh);
+
+printk("%s: new table = %px\n", __func__, table);
+	return true;
+}
+
+static void copy_initial_buckets(const struct rbh *rbh,
+		struct rbh_table *table, struct rbh_initial_table *init_table)
+	__acquires(&init_table->buckets[0].b.rbh_lock)
+	__acquires(&init_table->buckets[1].b.rbh_lock)
+{
+	struct rbh_bucket *bucket;
+
+	bucket = (void __force *)table->buckets[0];
+	spin_lock(&init_table->buckets[0].b.rbh_lock);
+	memcpy(bucket, &init_table->buckets[0], sizeof(init_table->buckets[0]));
+	bucket_lock_init(rbh, bucket);
+
+	bucket = (void __force *)table->buckets[1];
+	spin_lock_nested(&init_table->buckets[1].b.rbh_lock, 1);
+	memcpy(bucket, &init_table->buckets[1], sizeof(init_table->buckets[1]));
+	bucket_lock_init(rbh, bucket);
+}
+
+/*
+ * When we grow the table, we duplicate the bucket pointers so this
+ * thread doesn't pay the entire cost of growing the table.
+ */
+static int rbh_grow_table(struct rbh *rbh, u32 hash)
+{
+	struct rbh_table *table, *old_table;
+	struct rbh_initial_table *init_table;
+	unsigned long tagged;
+	u32 mask, buddy;
+	size_t size;
+
+	/* If the bucket is shared, we don't need to grow the table */
+	tagged = READ_ONCE(rbh->rbh_table);
+	table = (struct rbh_table *)(tagged & (tagged + 1));
+	mask = tagged - (unsigned long)table;
+	buddy = (hash ^ ((mask + 1) / 2)) & mask;
+	if (mask > 1 && table->buckets[hash & mask] == table->buckets[buddy])
+		return 0;
+
+	old_table = table;
+	size = (mask + 1) * 2 * sizeof(void *);
+	if (size > 4 * PAGE_SIZE)
+		/* XXX: NUMA_NO_NODE doesn't necessarily interleave */
+		table = __vmalloc_node(size, size, GFP_KERNEL, NUMA_NO_NODE,
+				&table);
+	else
+		table = kvmalloc(size, GFP_KERNEL);
+	if (!table) {
+		/* Maybe somebody resized it for us */
+		if (READ_ONCE(rbh->rbh_table) != tagged)
+			return 0;
+		return -ENOMEM;
+	}
+
+	if (mask == 1) {
+		/* Don't need to bother with RCU until we publish the table */
+		table->buckets[0] = (void __rcu *)kmalloc(INITIAL_SIZE, GFP_KERNEL);
+		if (!table->buckets[0])
+			goto free_all;
+		table->buckets[1] = (void __rcu *)kmalloc(INITIAL_SIZE, GFP_KERNEL);
+		if (!table->buckets[1])
+			goto free_all;
+	}
+
+	rbh_resize_lock(rbh);
+	if (rbh->rbh_table != tagged) {
+		rbh_resize_unlock(rbh);
+		/* Somebody else resized it for us */
+		kvfree(table);
+		return 0;
+	}
+
+printk("%s: replacing old_table %px with table %px mask %d\n", __func__, old_table, table, mask);
+	if (mask == 1) {
+		init_table = (void *)old_table;
+		copy_initial_buckets(rbh, table, init_table);
+	} else {
+		memcpy(&table->buckets, &old_table->buckets,
+				(mask + 1) * sizeof(void *));
+	}
+	memcpy(&table->buckets[mask + 1], &table->buckets[0],
+			(mask + 1) * sizeof(void *));
+
+	tagged = ((unsigned long)table) | (mask << 1) | 1;
+	/* rcu_assign_pointer() but not a pointer */
+	smp_store_release(&rbh->rbh_table, tagged);
+	rbh_resize_unlock(rbh);
+	if (mask == 1) {
+		spin_unlock(&init_table->buckets[0].b.rbh_lock);
+		spin_unlock(&init_table->buckets[1].b.rbh_lock);
+	}
+	kvfree_rcu_mightsleep(old_table);
+
+	return 0;
+free_all:
+	kfree((void __force *)table->buckets[0]);
+	kvfree(table);
+	return false;
+}
+
+static void bucket_copy(const struct rbh *rbh, struct rbh_bucket *bucket,
+		const struct rbh_bucket *old_bucket, unsigned int nr,
+		u32 hash, u32 mask)
+{
+	unsigned int old_nr = old_bucket->rbh_entries;
+	void __rcu **old_slots = bucket_slots(old_bucket, old_nr);
+	void __rcu **slots = bucket_slots(bucket, nr);
+	unsigned int i, j = 0;
+
+	bucket->rbh_entries = nr;
+	bucket_lock_init(rbh, bucket);
+	for (i = 0; i < old_nr; i++) {
+		if ((old_bucket->rbh_hashes[i] & mask) != (hash & mask))
+			continue;
+		bucket->rbh_hashes[j] = old_bucket->rbh_hashes[i];
+		slots[j++] = old_slots[i];
+	}
+printk("%s: bucket:%px(%d) copied %d/%d entries from %px hash:%x mask:%x\n", __func__, bucket, nr, j, old_nr, old_bucket, hash, mask);
+
+	while (j < nr)
+		bucket->rbh_hashes[j++] = ~hash;
+}
+
+#define rbh_dereference_protected(p, rbh)				\
+	rcu_dereference_protected(p, lockdep_is_held(&(rbh)->rbh_resize_lock))
+
+/*
+ */
+static void rbh_split_bucket(struct rbh *rbh,
+		struct rbh_bucket *old_bucket, unsigned int nr, u32 hash)
+{
+	struct rbh_table *table;
+	unsigned long tagged;
+	u32 mask, bit;
+	struct rbh_bucket *bucket = kmalloc(BUCKET_SIZE(nr), GFP_KERNEL);
+
+printk("%s: adding bucket %px for hash %d\n", __func__, bucket, hash);
+	if (!bucket)
+		return;
+
+	rbh_resize_lock(rbh);
+	tagged = rbh->rbh_table;
+	table = (struct rbh_table *)(tagged & (tagged + 1));
+	mask = tagged - (unsigned long)table;
+	hash &= mask;
+	if (rbh_dereference_protected(table->buckets[hash], rbh) != old_bucket)
+		goto free;
+
+	/* Figure out how many buckets we need to fill */
+	bit = (mask + 1) / 2;
+	mask = 0;
+	while (bit > 1) {
+printk("hash:%d buddy:%d\n", hash, hash ^ bit);
+		if (rbh_dereference_protected(table->buckets[hash ^ bit], rbh)
+				!= old_bucket)
+			break;
+		mask |= bit;
+		bit /= 2;
+	}
+	if (!mask)
+		goto free;
+
+	spin_lock(&old_bucket->rbh_lock);
+	bucket->rbh_contention = 0;
+	bucket_copy(rbh, bucket, old_bucket, nr, hash, mask | 1);
+
+	mask = (mask & (mask - 1)) / 2;
+printk("hash:%d mask:%d bit:%d\n", hash, mask, bit);
+	hash &= ~mask;
+	for (;;) {
+printk("assigning bucket %px to index %d\n", bucket, hash);
+		rcu_assign_pointer(table->buckets[hash], bucket);
+		if (hash == (hash | mask))
+			break;
+		hash += bit;
+	}
+	spin_unlock(&old_bucket->rbh_lock);
+	bucket = NULL;
+free:
+	rbh_resize_unlock(rbh);
+printk("%s: freeing bucket %px\n", __func__, bucket);
+	kfree(bucket);
+}
+
+/*
+ * thorns are leftovers from when this bucket was shared.  If they are
+ * sufficiently numerous, we'll just allocate another bucket of this size.
+ * Otherwise we'll allocate a larger bucket.
+ */
+static int rbh_expand_bucket(struct rbh *rbh,
+		struct rbh_bucket *old_bucket, unsigned int nr,
+		unsigned int thorns, u32 hash, u32 old_mask)
+{
+	unsigned long tagged;
+	struct rbh_table *table;
+	struct rbh_bucket *bucket;
+	u32 mask;
+
+	/* Can't expand an initial bucket, must create a table */
+	if (old_mask == 1)
+		return rbh_grow_table(rbh, hash);
+	if (thorns < nr / 4)
+		nr = next_bucket_size(nr);
+	if (nr > MAX_SLOTS)
+		return -E2BIG;
+
+	bucket = kmalloc(BUCKET_SIZE(nr), GFP_KERNEL);
+	if (!bucket)
+		return -ENOMEM;
+
+printk("%s: adding bucket %px for hash %d\n", __func__, bucket, hash);
+	rbh_resize_lock(rbh);
+	tagged = READ_ONCE(rbh->rbh_table);
+	table = (struct rbh_table *)(tagged & (tagged + 1));
+	mask = tagged - (unsigned long)table;
+
+	/* If the table expanded while we slept just try again */
+	if (old_mask != mask)
+		goto free;
+	hash &= mask;
+	if (rbh_dereference_protected(table->buckets[hash], rbh) != old_bucket)
+		goto free;
+
+	spin_lock(&old_bucket->rbh_lock);
+	bucket->rbh_contention = old_bucket->rbh_contention;
+	bucket_copy(rbh, bucket, old_bucket, nr, hash, mask);
+	rcu_assign_pointer(table->buckets[hash], bucket);
+
+	spin_unlock(&old_bucket->rbh_lock);
+	bucket = NULL;
+free:
+	rbh_resize_unlock(rbh);
+printk("%s: freeing bucket %px\n", __func__, bucket);
+	kfree(bucket);
+	if (bucket)
+		kvfree_rcu_mightsleep(old_bucket);
+
+	return 0;
+}
+
+static int __rbh_insert(struct rbh *rbh, u32 hash, void *p)
+{
+	struct rbh_bucket *bucket, *new_bucket;
+	unsigned int i, nr, thorns = 0;
+	int err;
+	struct insert_state state = { .split = SPLIT_MAYBE, };
+
+restart:
+	rcu_read_lock();
+	bucket = get_bucket(rbh, hash, &state);
+	if (unlikely(!bucket)) {
+		rcu_read_unlock();
+		if (!rbh_first(rbh, hash))
+			return -ENOMEM;
+		goto restart;
+	}
+
+again:
+	if (spin_trylock(&bucket->rbh_lock)) {
+		if (bucket->rbh_contention)
+			bucket->rbh_contention--;
+	} else {
+		spin_lock(&bucket->rbh_lock);
+		/* Numbers chosen ad-hoc */
+		bucket->rbh_contention += 10;
+		if (unlikely(bucket->rbh_contention > 5000)) {
+			spin_unlock(&bucket->rbh_lock);
+			rcu_read_unlock();
+			/* OK if this fails; it's only contention */
+			rbh_grow_table(rbh, hash);
+
+			rcu_read_lock();
+			bucket = get_bucket(rbh, hash, &state);
+			spin_lock(&bucket->rbh_lock);
+		}
+	}
+
+	new_bucket = get_bucket(rbh, hash, &state);
+	if (bucket != new_bucket) {
+		spin_unlock(&bucket->rbh_lock);
+		bucket = new_bucket;
+		goto again;
+	}
+
+printk("%s: bucket:%px hash %d\n", __func__, bucket, hash);
+	nr = bucket->rbh_entries;
+	/* If the bucket is shared, split it before inserting a new element */
+	if (state.split == SPLIT_SHOULD) {
+		spin_unlock(&bucket->rbh_lock);
+		rcu_read_unlock();
+		rbh_split_bucket(rbh, bucket, nr, hash);
+		state.split = SPLIT_DONE;
+		goto restart;
+	}
+
+	/* Deleted elements differ in their bottom bit */
+	for (i = 0; i < nr; i++) {
+		void __rcu **slots;
+		u32 bhash = bucket->rbh_hashes[i];
+
+		if ((bhash & state.mask) != (hash & state.mask))
+			thorns++;
+		if ((bhash & 1) == (hash & 1))
+			continue;
+printk("%s: hash:%x bhash:%x index %d\n", __func__, hash, bhash, i);
+		slots = bucket_slots(bucket, nr);
+		rcu_assign_pointer(slots[i], p);
+		/* This array is read under RCU */
+		WRITE_ONCE(bucket->rbh_hashes[i], hash);
+
+		spin_unlock(&bucket->rbh_lock);
+		rcu_read_unlock();
+		return 0;
+	}
+
+	/* No space in this bucket */
+	spin_unlock(&bucket->rbh_lock);
+	rcu_read_unlock();
+
+	err = rbh_expand_bucket(rbh, bucket, nr, thorns, hash, state.mask);
+	if (err == -ENOMEM)
+		return -ENOMEM;
+	if (err == -E2BIG) {
+		if (rbh_grow_table(rbh, hash) < 0)
+			return -ENOMEM;
+	}
+	state.split = SPLIT_MAYBE;
+	goto restart;
+}
+
+/**
+ * rbh_insert - Add a pointer to a rosebush.
+ * @rbh: The rosebush.
+ * @hash: The hash value for this pointer.
+ * @p: The pointer to add.
+ *
+ * Return: 0 on success, -ENOMEM if memory allocation fails,
+ * -EINVAL if @p is NULL.
+ */
+int rbh_insert(struct rbh *rbh, u32 hash, void *p)
+{
+	if (p == NULL)
+		return -EINVAL;
+	return __rbh_insert(rbh, hash, p);
+}
+EXPORT_SYMBOL(rbh_insert);
+
+/**
+ * rbh_remove - Remove a pointer from a rosebush.
+ * @rbh: The rosebush.
+ * @hash: The hash value for this pointer.
+ * @p: The pointer to remove.
+ *
+ * Return: 0 on success, -ENOENT if this pointer could not be found.
+ */
+int rbh_remove(struct rbh *rbh, u32 hash, void *p)
+{
+	struct rbh_bucket *bucket;
+	void __rcu **slots;
+	unsigned int i, nr;
+	int err = -ENOENT;
+
+	rcu_read_lock();
+	bucket = lock_bucket(rbh, hash);
+	if (!bucket)
+		goto rcu_unlock;
+
+	nr = bucket->rbh_entries;
+	slots = bucket_slots(bucket, nr);
+	for (i = 0; i < nr; i++) {
+		if (bucket->rbh_hashes[i] != hash)
+			continue;
+		if (rcu_dereference_protected(slots[i],
+				lockdep_is_held(&bucket->rbh_lock)) != p)
+			continue;
+		bucket->rbh_hashes[i] = ~hash;
+		/* Do not modify the slot */
+		err = 0;
+		break;
+	}
+
+	spin_unlock(&bucket->rbh_lock);
+rcu_unlock:
+	rcu_read_unlock();
+	return err;
+}
+EXPORT_SYMBOL(rbh_remove);
+
+/**
+ * rbh_reserve - Reserve a slot in a rosebush for later use.
+ * @rbh: The rosebush.
+ * @hash: The hash value that will be used.
+ *
+ * Some callers need to take another lock before inserting an object
+ * into the rosebush.  This function reserves space for them to do that.
+ * A subsequent call to rbh_use() will not allocate memory.  If you find
+ * that you do not need the reserved space any more, call rbh_remove(),
+ * passing NULL as the pointer.
+ *
+ * Return: 0 on success, -ENOMEM on failure.
+ */
+int rbh_reserve(struct rbh *rbh, u32 hash)
+{
+	return __rbh_insert(rbh, hash, NULL);
+}
+EXPORT_SYMBOL(rbh_reserve);
+
+/**
+ * rbh_use - Use a reserved slot in a rosebush.
+ * @rbh: The rosebush.
+ * @hash: The hash value for this pointer.
+ * @p: The pointer to add.
+ *
+ * Return: 0 on success, -EINVAL if @p is NULL,
+ * -ENOENT if no reserved slot could be found.
+ */
+int rbh_use(struct rbh *rbh, u32 hash, void *p)
+{
+	struct rbh_bucket *bucket;
+	void __rcu **slots;
+	unsigned int i, nr;
+	int err = -ENOENT;
+
+	rcu_read_lock();
+	bucket = lock_bucket(rbh, hash);
+	if (!bucket)
+		goto rcu_unlock;
+
+	nr = bucket->rbh_entries;
+	slots = bucket_slots(bucket, nr);
+	for (i = 0; i < nr; i++) {
+		if (bucket->rbh_hashes[i] != hash)
+			continue;
+		if (slots[i] != NULL)
+			continue;
+		rcu_assign_pointer(slots[i], p);
+		err = 0;
+		break;
+	}
+
+	spin_unlock(&bucket->rbh_lock);
+rcu_unlock:
+	rcu_read_unlock();
+	return err;
+}
+EXPORT_SYMBOL(rbh_use);
+
+/**
+ * rbh_next - Find the next entry matching this hash
+ * @rbhi: The rosebush iterator.
+ *
+ * Return: NULL if there are no more matching hash values, otherwise
+ * the next pointer.
+ */
+void *rbh_next(struct rbh_iter *rbhi)
+{
+	struct rbh_bucket *bucket = rbhi->bucket;
+	unsigned int nr;
+	void __rcu **slots;
+	void *p;
+
+	if (!bucket) {
+		bucket = get_bucket(rbhi->rbh, rbhi->hash, NULL);
+		if (!bucket)
+			return NULL;
+		rbhi->bucket = bucket;
+		rbhi->index = UINT_MAX;
+	}
+
+	nr = bucket->rbh_entries;
+	slots = bucket_slots(bucket, nr);
+
+	while (++rbhi->index < nr) {
+		if (READ_ONCE(bucket->rbh_hashes[rbhi->index]) != rbhi->hash)
+			continue;
+		p = rcu_dereference(slots[rbhi->index]);
+		if (p)
+			return p;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(rbh_next);
+
+/*
+ * TODO:
+ * * convert the dcache
+ * * 2 byte hashes in the bucket.  Once the table has 2^17 buckets, we can
+ *   use 10 bytes per entry instead of 12
+ * * 1 byte hashes in the bucket.  Once the table has 2^25 buckets, we can
+ *   use 9 bytes per entry instead of 10?
+ */
diff --git a/lib/test_rosebush.c b/lib/test_rosebush.c
new file mode 100644
index 000000000000..344479cb8a94
--- /dev/null
+++ b/lib/test_rosebush.c
@@ -0,0 +1,135 @@
+#include <linux/rosebush.h>
+#include <kunit/test.h>
+
+static void iter_rbh(struct kunit *test, struct rbh *rbh, u32 hash, void *p)
+{
+	RBH_ITER(iter, rbh, hash);
+	void *q;
+
+	rcu_read_lock();
+	q = rbh_next(&iter);
+	rcu_read_unlock();
+	KUNIT_EXPECT_PTR_EQ_MSG(test, p, q,
+		"rbh_next hash:%u returned %px, expected %px", hash, q, p);
+}
+
+static void check_empty_rbh(struct kunit *test, struct rbh *rbh)
+{
+	iter_rbh(test, rbh, 0, NULL);
+	iter_rbh(test, rbh, 1, NULL);
+	iter_rbh(test, rbh, 17, NULL);
+	iter_rbh(test, rbh, 42, NULL);
+}
+
+static void insert(struct kunit *test, struct rbh *rbh, u32 hash)
+{
+	void *p = (void *)((hash << 1) | 1UL);
+	int err;
+
+	err = rbh_insert(rbh, hash, p);
+	KUNIT_EXPECT_EQ(test, err, 0);
+
+	iter_rbh(test, rbh, hash, p);
+}
+
+static void reserve(struct kunit *test, struct rbh *rbh, u32 hash)
+{
+	int err;
+
+	err = rbh_reserve(rbh, hash);
+	KUNIT_EXPECT_EQ(test, err, 0);
+
+	iter_rbh(test, rbh, hash, NULL);
+}
+
+static void use(struct kunit *test, struct rbh *rbh, u32 hash)
+{
+	void *p = (void *)((hash << 1) | 1UL);
+	int err;
+
+	err = rbh_use(rbh, hash, p);
+	KUNIT_EXPECT_EQ(test, err, 0);
+
+	iter_rbh(test, rbh, hash, p);
+}
+
+static void remove(struct kunit *test, struct rbh *rbh, u32 hash)
+{
+	void *p = (void *)((hash << 1) | 1UL);
+	int err;
+
+	err = rbh_remove(rbh, hash, p);
+	KUNIT_EXPECT_EQ(test, err, 0);
+
+	iter_rbh(test, rbh, hash, NULL);
+}
+
+static DEFINE_ROSEBUSH(rosebush);
+
+/*
+ * Conduct a number of tests on a rosebush that has never been used.
+ * They should all return NULL or an errno.  We're looking for crashes
+ * here.
+ */
+static void empty(struct kunit *test)
+{
+	int err;
+
+	check_empty_rbh(test, &rosebush);
+	err = rbh_remove(&rosebush, 0, test);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+	err = rbh_use(&rosebush, 0, test);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+	KUNIT_EXPECT_EQ(test, rosebush.rbh_table, 0);
+}
+
+static void first(struct kunit *test)
+{
+	int err;
+
+	insert(test, &rosebush, 5);
+	check_empty_rbh(test, &rosebush);
+	remove(test, &rosebush, 5);
+	check_empty_rbh(test, &rosebush);
+
+	err = rbh_remove(&rosebush, 5, NULL);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+	reserve(test, &rosebush, 5);
+	err = rbh_remove(&rosebush, 5, test);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+	err = rbh_remove(&rosebush, 5, NULL);
+	KUNIT_EXPECT_EQ(test, err, 0);
+	err = rbh_remove(&rosebush, 5, NULL);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+
+	reserve(test, &rosebush, 5);
+	use(test, &rosebush, 5);
+	err = rbh_remove(&rosebush, 5, NULL);
+	KUNIT_EXPECT_EQ(test, err, -ENOENT);
+	remove(test, &rosebush, 5);
+}
+
+static void grow(struct kunit *test)
+{
+	int i;
+
+	for (i = 3; i < 3333; i += 2)
+		insert(test, &rosebush, i);
+}
+
+static struct kunit_case rosebush_cases[] __refdata = {
+	KUNIT_CASE(empty),
+	KUNIT_CASE(first),
+	KUNIT_CASE(grow),
+	{}
+};
+
+static struct kunit_suite rosebush_suite = {
+	.name = "rosebush",
+	.test_cases = rosebush_cases,
+};
+
+kunit_test_suite(rosebush_suite);
+
+MODULE_AUTHOR("Matthew Wilcox (Oracle) <willy@infradead.org>");
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
  2024-02-22 20:37 ` [PATCH 1/1] rosebush: Add new data structure Matthew Wilcox (Oracle)
@ 2024-02-23 11:37 ` Peng Zhang
  2024-02-23 13:55 ` Jason A. Donenfeld
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Peng Zhang @ 2024-02-23 11:37 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Thomas Graf, Herbert Xu, netdev, linux-fsdevel, maple-tree, rcu,
	linux-kernel



在 2024/2/23 04:37, Matthew Wilcox (Oracle) 写道:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.
> 
> You can see this code as a further exploration of the "Linked lists are
> evil" design space.  For the workloads which a hashtable is suited to,
> it has lower overhead than either the maple tree or the rhashtable.
> It cannot support ranges (so is not a replacement for the maple tree),
> but it does have per-bucket locks so is more scalable for write-heavy
> workloads.  I suspect one could reimplement the rhashtable API on top
> of the rosebush, but I am not interested in doing that work myself.
> 
> The per-object overhead is 12 bytes, as opposed to 16 bytes for the
> rhashtable and 32 bytes for the maple tree.  The constant overhead is also
> small, being just 16 bytes for the struct rosebush.  The exact amount
> of memory consumed for a given number of objects is going to depend on
> the distribution of hashes; here are some estimated consumptions for
> power-of-ten entries distributed evenly over a 32-bit hash space in the
> various data structures:
> 
> number	xarray	maple	rhash	rosebush
> 1	3472	272	280	256
> 10	32272	784	424	256
> 100	262kB	3600	1864	2080
> 1000	[1]	34576	17224	16432
> 10k	[1]	343k	168392	131344
> 100k	[1]	3.4M	1731272	2101264
> 
> As you can see, rosebush and rhashtable are close the whole way.
> Rosebush moves in larger chunks because it doubles each time; there's
> no actual need to double the bucket size, but that works well with
> the slab allocator's existing slabs.  As noted in the documentation,
> we could create our own slabs and get closer to the 12 bytes per object
> minimum consumption. [2]
> 
> Where I expect rosebush to shine is on dependent cache misses.
> I've assumed an average chain length of 10 for rhashtable in the above
> memory calculations.  That means on average a lookup would take five cache
> misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> hashes looking for matches, so the CPU can usefully speculate the entire
> array of hash values (indeed, we tell it exactly how many we're going to
> look at) and then take a single cache miss fetching the correct pointer.
> Add that to the cache miss to fetch the bucket and that's just two cache
> misses rather than five.
> 
> I have not yet converted any code to use the rosebush.  The API is
> designed for use by the dcache, and I expect it will evolve as it actually
Hello,

It seems that the advantage of this hash table is that it does not need to
traverse the linked list and has fewer cache misses. I want to know how
much performance improvement is expected if it is applied to dcache?

Thanks,
Peng
> gets used.  I think there's probably some more refactoring to be done.
> I am not aware of any bugs, but the test suite is pitifully small.
> The current algorithm grows the buckets more aggressively than the table;
> that's probably exactly the wrong thing to do for good performance.
> 
> This version is full of debugging printks.  You should probably take
> them out if you're going to try to benchmark it.  The regex '^print'
> should find them all.  Contributions welcome; I really want to get back
> to working on folios, but this felt like an urgent problem to be fixed.
> 
> [1] I stopped trying to estimate the memory costs for xarray; I couldn't
> be bothered to as it's not a serious competitor for this use case.
> 
> [2] We have ideas for improving the maple tree memory consumption for
> this kind of workload; a new node type for pivots that fit in 4 bytes and
> sparse nodes to avoid putting a NULL entry after each occupied entry.
> The maple tree really is optimised for densely packed ranges at the
> moment.
> 
> Matthew Wilcox (Oracle) (1):
>    rosebush: Add new data structure
> 
>   Documentation/core-api/index.rst    |   1 +
>   Documentation/core-api/rosebush.rst | 135 ++++++
>   MAINTAINERS                         |   8 +
>   include/linux/rosebush.h            |  41 ++
>   lib/Kconfig.debug                   |   3 +
>   lib/Makefile                        |   3 +-
>   lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
>   lib/test_rosebush.c                 | 135 ++++++
>   8 files changed, 1032 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/core-api/rosebush.rst
>   create mode 100644 include/linux/rosebush.h
>   create mode 100644 lib/rosebush.c
>   create mode 100644 lib/test_rosebush.c
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
  2024-02-22 20:37 ` [PATCH 1/1] rosebush: Add new data structure Matthew Wilcox (Oracle)
  2024-02-23 11:37 ` [PATCH 0/1] Rosebush, a new hash table Peng Zhang
@ 2024-02-23 13:55 ` Jason A. Donenfeld
  2024-02-23 18:40 ` Kent Overstreet
  2024-02-24  0:20 ` Herbert Xu
  4 siblings, 0 replies; 19+ messages in thread
From: Jason A. Donenfeld @ 2024-02-23 13:55 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, Thomas Graf, Herbert Xu, netdev, linux-fsdevel,
	maple-tree, rcu

Hi Matthew,

On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.

If you're interested, WireGuard has some pretty primitive hashtables,
for which maybe Rosebush would be an interesting replacement:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/peerlookup.h#n17
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireguard/ratelimiter.c#n167

In peerlookup.c, note the "At the moment, we limit" comment for an idea
of some of the hairy issues involved in replacing these. But I wouldn't
be entirely opposed to it, if you see some interesting potential for
Rosebush here. That's a very tentative interest -- maybe it won't work
out in the end -- but nonetheless, seeing this piqued my curiosity. If
you're looking to see how this behaves in a place beyond dcache, this
might be something to play with.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2024-02-23 13:55 ` Jason A. Donenfeld
@ 2024-02-23 18:40 ` Kent Overstreet
  2024-02-24  0:20 ` Herbert Xu
  4 siblings, 0 replies; 19+ messages in thread
From: Kent Overstreet @ 2024-02-23 18:40 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, Thomas Graf, Herbert Xu, netdev, linux-fsdevel,
	maple-tree, rcu

On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.
> 
> You can see this code as a further exploration of the "Linked lists are
> evil" design space.  For the workloads which a hashtable is suited to,
> it has lower overhead than either the maple tree or the rhashtable.
> It cannot support ranges (so is not a replacement for the maple tree),
> but it does have per-bucket locks so is more scalable for write-heavy
> workloads.  I suspect one could reimplement the rhashtable API on top
> of the rosebush, but I am not interested in doing that work myself.
> 
> The per-object overhead is 12 bytes, as opposed to 16 bytes for the
> rhashtable and 32 bytes for the maple tree.  The constant overhead is also
> small, being just 16 bytes for the struct rosebush.  The exact amount
> of memory consumed for a given number of objects is going to depend on
> the distribution of hashes; here are some estimated consumptions for
> power-of-ten entries distributed evenly over a 32-bit hash space in the
> various data structures:
> 
> number	xarray	maple	rhash	rosebush
> 1	3472	272	280	256
> 10	32272	784	424	256
> 100	262kB	3600	1864	2080
> 1000	[1]	34576	17224	16432
> 10k	[1]	343k	168392	131344
> 100k	[1]	3.4M	1731272	2101264

So I think the interesting numbers to see (besides performance numbers)
are going to be the fill factor numbers under real world use.

It's an interesting technique, I've played around with it a bit
(actually using it in bcachefs for the nocow locks hash table), but no
idea if it makes sense as a general purpose thing yet...

you also mentioned that a motivation was API mismatch between rhashtable
and dcache - could you elaborate on that?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2024-02-23 18:40 ` Kent Overstreet
@ 2024-02-24  0:20 ` Herbert Xu
  2024-02-24 22:10   ` David Laight
  4 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2024-02-24  0:20 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree, rcu

On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
>
> Where I expect rosebush to shine is on dependent cache misses.
> I've assumed an average chain length of 10 for rhashtable in the above
> memory calculations.  That means on average a lookup would take five cache
> misses that can't be speculated.  Rosebush does a linear walk of 4-byte

Normally an rhashtable gets resized when it reaches 75% capacity
so the average chain length should always be one.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/1] Rosebush, a new hash table
  2024-02-24  0:20 ` Herbert Xu
@ 2024-02-24 22:10   ` David Laight
  2024-02-25  0:50     ` Herbert Xu
  2024-02-25  3:18     ` Kent Overstreet
  0 siblings, 2 replies; 19+ messages in thread
From: David Laight @ 2024-02-24 22:10 UTC (permalink / raw)
  To: 'Herbert Xu', Matthew Wilcox (Oracle)
  Cc: linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree, rcu

From: Herbert Xu
> Sent: 24 February 2024 00:21
> 
> On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> >
> > Where I expect rosebush to shine is on dependent cache misses.
> > I've assumed an average chain length of 10 for rhashtable in the above
> > memory calculations.  That means on average a lookup would take five cache
> > misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> 
> Normally an rhashtable gets resized when it reaches 75% capacity
> so the average chain length should always be one.

The average length of non-empty hash chains is more interesting.
You don't usually search for items in empty chains.
The only way you'll get all the chains of length one is if you've
carefully picked the data so that it hashed that way.

I remember playing around with the elf symbol table for a browser
and all its shared libraries.
While the hash function is pretty trivial, it really didn't matter
whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
chains were always long.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-24 22:10   ` David Laight
@ 2024-02-25  0:50     ` Herbert Xu
  2024-02-25  3:20       ` Kent Overstreet
  2024-02-25  3:18     ` Kent Overstreet
  1 sibling, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2024-02-25  0:50 UTC (permalink / raw)
  To: David Laight
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
>
> > Normally an rhashtable gets resized when it reaches 75% capacity
> > so the average chain length should always be one.
> 
> The average length of non-empty hash chains is more interesting.
> You don't usually search for items in empty chains.
> The only way you'll get all the chains of length one is if you've
> carefully picked the data so that it hashed that way.

Sure.  But given the 75% capacity, you'd need a really bad hash
function to get an *average* (not worst-case) chain length of
10.

> I remember playing around with the elf symbol table for a browser
> and all its shared libraries.
> While the hash function is pretty trivial, it really didn't matter
> whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> chains were always long.

Even in the unlikely event of bad luck and everything bunches up
together, we change theh hash function (through hash_rnd) every
time we resize so you would expect things to even out after the
resize event.

A rehash is also automatically triggered if the worst-case chain
length exceeds 16.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-24 22:10   ` David Laight
  2024-02-25  0:50     ` Herbert Xu
@ 2024-02-25  3:18     ` Kent Overstreet
  2024-02-25  5:01       ` Matthew Wilcox
  2024-02-25 14:47       ` David Laight
  1 sibling, 2 replies; 19+ messages in thread
From: Kent Overstreet @ 2024-02-25  3:18 UTC (permalink / raw)
  To: David Laight
  Cc: 'Herbert Xu', Matthew Wilcox (Oracle),
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> From: Herbert Xu
> > Sent: 24 February 2024 00:21
> > 
> > On Thu, Feb 22, 2024 at 08:37:23PM +0000, Matthew Wilcox (Oracle) wrote:
> > >
> > > Where I expect rosebush to shine is on dependent cache misses.
> > > I've assumed an average chain length of 10 for rhashtable in the above
> > > memory calculations.  That means on average a lookup would take five cache
> > > misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> > 
> > Normally an rhashtable gets resized when it reaches 75% capacity
> > so the average chain length should always be one.
> 
> The average length of non-empty hash chains is more interesting.
> You don't usually search for items in empty chains.
> The only way you'll get all the chains of length one is if you've
> carefully picked the data so that it hashed that way.
> 
> I remember playing around with the elf symbol table for a browser
> and all its shared libraries.
> While the hash function is pretty trivial, it really didn't matter
> whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> chains were always long.

that's a pretty bad hash, even golden ratio hash would be better, but
still bad; you really should be using at least jhash.

you really want a good avalanche effect, because in real world usage
your entropy is often only in a relatively few bits.

when I implemented cuckoo (which is more obviously sensitive to a weak
hash function), I had to go with siphash, even jhash wasn't giving me
great reslts. and looking at the code it's not hard to see why, it's all
adds, and the rotates are byte aligned... you want mixed adds and xors
and the rotates to be more prime-ish.

right idea, just old...

what would be ideal is something more like siphash, but with fewer
rounds, so same number of instructions as jhash. xxhash might fit the
bill, I haven't looked at the code yet...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  0:50     ` Herbert Xu
@ 2024-02-25  3:20       ` Kent Overstreet
  0 siblings, 0 replies; 19+ messages in thread
From: Kent Overstreet @ 2024-02-25  3:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Laight, Matthew Wilcox (Oracle),
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sun, Feb 25, 2024 at 08:50:36AM +0800, Herbert Xu wrote:
> On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> >
> > > Normally an rhashtable gets resized when it reaches 75% capacity
> > > so the average chain length should always be one.
> > 
> > The average length of non-empty hash chains is more interesting.
> > You don't usually search for items in empty chains.
> > The only way you'll get all the chains of length one is if you've
> > carefully picked the data so that it hashed that way.
> 
> Sure.  But given the 75% capacity, you'd need a really bad hash
> function to get an *average* (not worst-case) chain length of
> 10.
> 
> > I remember playing around with the elf symbol table for a browser
> > and all its shared libraries.
> > While the hash function is pretty trivial, it really didn't matter
> > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > chains were always long.
> 
> Even in the unlikely event of bad luck and everything bunches up
> together, we change theh hash function (through hash_rnd) every
> time we resize so you would expect things to even out after the
> resize event.
> 
> A rehash is also automatically triggered if the worst-case chain
> length exceeds 16.

16!? that's crap, use a decent hash function and 3-5 should be your
worst upper bound.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  3:18     ` Kent Overstreet
@ 2024-02-25  5:01       ` Matthew Wilcox
  2024-02-25  5:32         ` Herbert Xu
  2024-02-25  5:51         ` Kent Overstreet
  2024-02-25 14:47       ` David Laight
  1 sibling, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2024-02-25  5:01 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: David Laight, 'Herbert Xu',
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote:
> On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> > I remember playing around with the elf symbol table for a browser
> > and all its shared libraries.
> > While the hash function is pretty trivial, it really didn't matter
> > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > chains were always long.
> 
> that's a pretty bad hash, even golden ratio hash would be better, but
> still bad; you really should be using at least jhash.

There's a "fun" effect; essentially the "biased observer" effect which
leads students to erroneously conclude that the majority of classes are
oversubscribed.  As somebody observed in this thread, for some usecases
you only look up hashes which actually exist.

Task a trivial example where you have four entries unevenly distributed
between two buckets, three in one bucket and one in the other.  Now 3/4
of your lookups hit in one bucket and 1/4 in the other bucket.
Obviously it's not as pronounced if you have 1000 buckets with 1000
entries randomly distributed between the buckets.  But that distribution
is not nearly as even as you might expect:

$ ./distrib
0: 362
1: 371
2: 193
3: 57
4: 13
5: 4

That's using lrand48() to decide which bucket to use, so not even a
"quality of hash" problem, just a "your mathematical intuition may not
be right here".

To put this data another way, 371 entries are in a bucket with a single
entry, 384 are in a bucket with two entries, 171 are in a 3-entry
bucket, 52 are in a 4-entry bucket and 20 are in a 5-entry bucket.

$ cat distrib.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

int bucket[1000];
int freq[10];

int main(int argc, char **argv)
{
	int i;

	for (i = 0; i < 1000; i++)
		bucket[lrand48() % 1000]++;

	for (i = 0; i < 1000; i++)
		freq[bucket[i]]++;

	for (i = 0; i < 10; i++)
		printf("%d: %d\n", i, freq[i]);

	return 0;
}

(ok, quibble about "well, 1000 doesn't divide INT_MAX evenly so your
random number generation is biased", but i maintain that will not
materially affect these results due to it affecting only 0.00003% of
numbers generated)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  5:01       ` Matthew Wilcox
@ 2024-02-25  5:32         ` Herbert Xu
  2024-02-25  5:51         ` Kent Overstreet
  1 sibling, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2024-02-25  5:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, David Laight, linux-kernel, Thomas Graf, netdev,
	linux-fsdevel, maple-tree, rcu

On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote:
> 
> Task a trivial example where you have four entries unevenly distributed
> between two buckets, three in one bucket and one in the other.  Now 3/4
> of your lookups hit in one bucket and 1/4 in the other bucket.
> Obviously it's not as pronounced if you have 1000 buckets with 1000
> entries randomly distributed between the buckets.  But that distribution
> is not nearly as even as you might expect:
> 
> $ ./distrib
> 0: 362
> 1: 371
> 2: 193
> 3: 57
> 4: 13
> 5: 4

Indeed, that's why rhashtable only triggers a forced rehash at
a chain length of 16 even though we expect the average chain length
to be just 1.

The theoretical worst-case value is expected to be O(lg n/lg lg n).
However, I think 16 was picked because it was sufficient even for a
hash table that filled all memory.  Of course if anyone can provide
some calculation showing that this is insufficient I'm happy to raise
the limit.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  5:01       ` Matthew Wilcox
  2024-02-25  5:32         ` Herbert Xu
@ 2024-02-25  5:51         ` Kent Overstreet
  2024-02-25  5:53           ` Herbert Xu
  1 sibling, 1 reply; 19+ messages in thread
From: Kent Overstreet @ 2024-02-25  5:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Laight, 'Herbert Xu',
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote:
> On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote:
> > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> > > I remember playing around with the elf symbol table for a browser
> > > and all its shared libraries.
> > > While the hash function is pretty trivial, it really didn't matter
> > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > > chains were always long.
> > 
> > that's a pretty bad hash, even golden ratio hash would be better, but
> > still bad; you really should be using at least jhash.
> 
> There's a "fun" effect; essentially the "biased observer" effect which
> leads students to erroneously conclude that the majority of classes are
> oversubscribed.  As somebody observed in this thread, for some usecases
> you only look up hashes which actually exist.
> 
> Task a trivial example where you have four entries unevenly distributed
> between two buckets, three in one bucket and one in the other.  Now 3/4
> of your lookups hit in one bucket and 1/4 in the other bucket.
> Obviously it's not as pronounced if you have 1000 buckets with 1000
> entries randomly distributed between the buckets.  But that distribution
> is not nearly as even as you might expect:
> 
> $ ./distrib
> 0: 362
> 1: 371
> 2: 193
> 3: 57
> 4: 13
> 5: 4
> 
> That's using lrand48() to decide which bucket to use, so not even a
> "quality of hash" problem, just a "your mathematical intuition may not
> be right here".

well, golden ratio hash - hash_32(i, 32)
0: 368
1: 264
2: 368
3: 0

but your distribution actually is accurate in general, golden ratio hash
is relly nice for sequential integers. the actual problem with your test
is that you're testing 100% occupancy - no one does that.

75% occupancy, siphash:
0: 933
1: 60
2: 6
3: 1
4: 0

that looks about right to me.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  5:51         ` Kent Overstreet
@ 2024-02-25  5:53           ` Herbert Xu
  2024-02-25  6:14             ` Kent Overstreet
  0 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2024-02-25  5:53 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, David Laight, linux-kernel, Thomas Graf, netdev,
	linux-fsdevel, maple-tree, rcu

On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote:
>
> but your distribution actually is accurate in general, golden ratio hash
> is relly nice for sequential integers. the actual problem with your test
> is that you're testing 100% occupancy - no one does that.
> 
> 75% occupancy, siphash:
> 0: 933
> 1: 60
> 2: 6
> 3: 1
> 4: 0
> 
> that looks about right to me.

The point is that the worst-case length grows with the size of
the table so it won't always be 3.  You need to take into account
the largest table size that you will support.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  5:53           ` Herbert Xu
@ 2024-02-25  6:14             ` Kent Overstreet
  2024-02-25  6:17               ` Herbert Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Kent Overstreet @ 2024-02-25  6:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Matthew Wilcox, David Laight, linux-kernel, Thomas Graf, netdev,
	linux-fsdevel, maple-tree, rcu

On Sun, Feb 25, 2024 at 01:53:35PM +0800, Herbert Xu wrote:
> On Sun, Feb 25, 2024 at 12:51:06AM -0500, Kent Overstreet wrote:
> >
> > but your distribution actually is accurate in general, golden ratio hash
> > is relly nice for sequential integers. the actual problem with your test
> > is that you're testing 100% occupancy - no one does that.
> > 
> > 75% occupancy, siphash:
> > 0: 933
> > 1: 60
> > 2: 6
> > 3: 1
> > 4: 0
> > 
> > that looks about right to me.
> 
> The point is that the worst-case length grows with the size of
> the table so it won't always be 3.  You need to take into account
> the largest table size that you will support.

ok, but - one million entries, siphash, 75% fill factor

0: 472053
1: 354786
2: 132663
3: 33267
4: 6218
5: 884
6: 110
7: 17
8: 2
9: 0

100 million:

0: 51342703
1: 34224025
2: 11413241
3: 2534946
4: 421816
5: 56271
6: 6346
7: 593
8: 56
9: 3
10: 0

it's a log curve - chain length of 16 means you picked a bad hash
function.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  6:14             ` Kent Overstreet
@ 2024-02-25  6:17               ` Herbert Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2024-02-25  6:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, David Laight, linux-kernel, Thomas Graf, netdev,
	linux-fsdevel, maple-tree, rcu

On Sun, Feb 25, 2024 at 01:14:39AM -0500, Kent Overstreet wrote:
>
> it's a log curve -

Yes it's O(log N/log log N).

> chain length of 16 means you picked a bad hash
> function.

Or that someone is trying to attack you.  The number 16 is the
cut-off where we decide that someone has discovered our hash
secret and can cause a particular to chain to grow without bound.

At this point rhashtable will force a rehash with a different
secret.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/1] rosebush: Add new data structure
  2024-02-22 20:37 ` [PATCH 1/1] rosebush: Add new data structure Matthew Wilcox (Oracle)
@ 2024-02-25  6:38   ` Al Viro
  0 siblings, 0 replies; 19+ messages in thread
From: Al Viro @ 2024-02-25  6:38 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, Thomas Graf, Herbert Xu, netdev, linux-fsdevel,
	maple-tree, rcu

On Thu, Feb 22, 2024 at 08:37:24PM +0000, Matthew Wilcox (Oracle) wrote:
> Rosebush is a resizing hash table.  See
> Docuemntation/core-api/rosebush.rst for details.

Interesting...  A few obvious questions wrt dcache:
	* how do the locks that stuff nest wrt ->d_lock?  Inside?
	* same for rename_lock (on rename dentries change names/parents/hash
values).
	* we would be really forced to maintain DCACHE_UNHASHED in ->d_flags;
it's not impossible to do, but will take care (protection of ->d_flags has
several dark corners).
	* the cost of d_drop() goes up.  Might or might not get painful.
	* if the bucket locks nest inside ->d_lock, we'd better look out
for the latter getting held for longer.  Might get unpleasant.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25  3:18     ` Kent Overstreet
  2024-02-25  5:01       ` Matthew Wilcox
@ 2024-02-25 14:47       ` David Laight
  2024-02-25 21:48         ` Kent Overstreet
  1 sibling, 1 reply; 19+ messages in thread
From: David Laight @ 2024-02-25 14:47 UTC (permalink / raw)
  To: 'Kent Overstreet'
  Cc: 'Herbert Xu', Matthew Wilcox (Oracle),
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

From: Kent Overstreet
> Sent: 25 February 2024 03:19
..
> when I implemented cuckoo (which is more obviously sensitive to a weak
> hash function), I had to go with siphash, even jhash wasn't giving me
> great reslts. and looking at the code it's not hard to see why, it's all
> adds, and the rotates are byte aligned... you want mixed adds and xors
> and the rotates to be more prime-ish.
> 
> right idea, just old...
> 
> what would be ideal is something more like siphash, but with fewer
> rounds, so same number of instructions as jhash. xxhash might fit the
> bill, I haven't looked at the code yet...

There is likely to be a point where scanning a list of values
for the right hash value is faster than executing a hash function
that is good enough to separate them to separate buckets.

You don't want to scan a linked list because they have horrid
cache footprints.
The locking is equally horrid - especially for remove.
Arrays of pointers ar ethe way forward :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/1] Rosebush, a new hash table
  2024-02-25 14:47       ` David Laight
@ 2024-02-25 21:48         ` Kent Overstreet
  0 siblings, 0 replies; 19+ messages in thread
From: Kent Overstreet @ 2024-02-25 21:48 UTC (permalink / raw)
  To: David Laight
  Cc: 'Herbert Xu', Matthew Wilcox (Oracle),
	linux-kernel, Thomas Graf, netdev, linux-fsdevel, maple-tree,
	rcu

On Sun, Feb 25, 2024 at 02:47:45PM +0000, David Laight wrote:
> From: Kent Overstreet
> > Sent: 25 February 2024 03:19
> ..
> > when I implemented cuckoo (which is more obviously sensitive to a weak
> > hash function), I had to go with siphash, even jhash wasn't giving me
> > great reslts. and looking at the code it's not hard to see why, it's all
> > adds, and the rotates are byte aligned... you want mixed adds and xors
> > and the rotates to be more prime-ish.
> > 
> > right idea, just old...
> > 
> > what would be ideal is something more like siphash, but with fewer
> > rounds, so same number of instructions as jhash. xxhash might fit the
> > bill, I haven't looked at the code yet...
> 
> There is likely to be a point where scanning a list of values
> for the right hash value is faster than executing a hash function
> that is good enough to separate them to separate buckets.

Executing a bunch of alu ops that parallelize nicely is _fast_
(it's all xors/adds/rotates, chacha20 is a good example of how
the modern stuff works).

Also, for 2-way cuckoo there's an xor trick so you don't need to compute
a second hash.

But the key thing about cuckoo is that the loads on lookup _aren't
dependent_ - they can run in parallel. Every cache miss that goes all
the way to DRAM is stupidly expensive, remember - hundreds of
instructions, had you been able to keep the pipeline fed.

> You don't want to scan a linked list because they have horrid
> cache footprints.
> The locking is equally horrid - especially for remove.
> Arrays of pointers ar ethe way forward :-)

Well, maybe; I'm waiting to see fill factor numbers and benchmarks. Fill
factor was my concern when I was playing around with the concept; I used
it in a place where the hash table didn't need to be that big, and the
point was to avoid having to separately allocate the entries (and it
avoids the hassle of tombstone entries with linear/quadratic probing).

The fact that it requires a second dependent load, because buckets are
separately allocated, also looks like a big negative to me. I still
think a good lockless cuckoo hashing implementation ought to beat it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-02-25 21:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 20:37 [PATCH 0/1] Rosebush, a new hash table Matthew Wilcox (Oracle)
2024-02-22 20:37 ` [PATCH 1/1] rosebush: Add new data structure Matthew Wilcox (Oracle)
2024-02-25  6:38   ` Al Viro
2024-02-23 11:37 ` [PATCH 0/1] Rosebush, a new hash table Peng Zhang
2024-02-23 13:55 ` Jason A. Donenfeld
2024-02-23 18:40 ` Kent Overstreet
2024-02-24  0:20 ` Herbert Xu
2024-02-24 22:10   ` David Laight
2024-02-25  0:50     ` Herbert Xu
2024-02-25  3:20       ` Kent Overstreet
2024-02-25  3:18     ` Kent Overstreet
2024-02-25  5:01       ` Matthew Wilcox
2024-02-25  5:32         ` Herbert Xu
2024-02-25  5:51         ` Kent Overstreet
2024-02-25  5:53           ` Herbert Xu
2024-02-25  6:14             ` Kent Overstreet
2024-02-25  6:17               ` Herbert Xu
2024-02-25 14:47       ` David Laight
2024-02-25 21:48         ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).