linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/32] bcachefs - a new COW filesystem
@ 2023-05-09 16:56 Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
                   ` (32 more replies)
  0 siblings, 33 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block, linux-mm, linux-bcachefs
  Cc: Kent Overstreet, viro, akpm, boqun.feng, brauner, hch, colyli,
	djwong, mingo, jack, axboe, willy, ojeda, ming.lei, ndesaulniers,
	peterz, phillip, urezki, longman, will

I'm submitting the bcachefs filesystem for review and inclusion.

Included in this patch series are all the non fs/bcachefs/ patches. The
entire tree, based on v6.3, may be found at:

  http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream

----------------------------------------------------------------

bcachefs overview, status:

Features:
 - too many to list

Known bugs:
 - too many to list

Status:
 - Snapshots have been declared stable; one serious bug report
   outstanding to look into, most users report it working well.

   These are RW btrfs-style snapshots, but with far better scalability
   and no scalability issues with sparse snapshots due to key level
   versioning.

 - Erasure coding is getting really close; hope to have it ready for
   users to beat on it by this summer. This is a novel RAID/erasure
   coding design with no write hole, and no fragmentation of writes
   (e.g. RAIDZ).

 - Tons of scalabality work finished over the past year, users are
   running it on 100 TB filesystems without complaint, waiting for first
   1 PB user; next thing to address re: scalability is fsck/recovery
   memory usage.

 - Test infrastructure! Major project milestone, check out our test
   dashboard at
     https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

Other project notes:

irc::/irc.oftc.net/bcache is where most activity happens; I'm always
there, and most code review happens there - I find the conversational
format more productive.

------------------------------------------------

patches in this series:

Christopher James Halse Rogers (1):
  stacktrace: Export stack_trace_save_tsk

Daniel Hill (1):
  lib: add mean and variance module.

Dave Chinner (3):
  vfs: factor out inode hash head calculation
  hlist-bl: add hlist_bl_fake()
  vfs: inode cache conversion to hash-bl

Kent Overstreet (27):
  Compiler Attributes: add __flatten
  locking/lockdep: lock_class_is_held()
  locking/lockdep: lockdep_set_no_check_recursion()
  locking: SIX locks (shared/intent/exclusive)
  MAINTAINERS: Add entry for six locks
  sched: Add task_struct->faults_disabled_mapping
  mm: Bring back vmalloc_exec
  fs: factor out d_mark_tmpfile()
  block: Add some exports for bcachefs
  block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  block: Bring back zero_fill_bio_iter
  block: Rework bio_for_each_segment_all()
  block: Rework bio_for_each_folio_all()
  block: Don't block on s_umount from __invalidate_super()
  bcache: move closures to lib/
  MAINTAINERS: Add entry for closures
  closures: closure_wait_event()
  closures: closure_nr_remaining()
  closures: Add a missing include
  iov_iter: copy_folio_from_iter_atomic()
  MAINTAINERS: Add entry for generic-radix-tree
  lib/generic-radix-tree.c: Don't overflow in peek()
  lib/generic-radix-tree.c: Add a missing include
  lib/generic-radix-tree.c: Add peek_prev()
  lib/string_helpers: string_get_size() now returns characters wrote
  lib: Export errname
  MAINTAINERS: Add entry for bcachefs

 MAINTAINERS                                   |  39 +
 block/bdev.c                                  |   2 +-
 block/bio.c                                   |  57 +-
 block/blk-core.c                              |   1 +
 block/blk-map.c                               |  38 +-
 block/blk.h                                   |   1 -
 block/bounce.c                                |  12 +-
 drivers/md/bcache/Kconfig                     |  10 +-
 drivers/md/bcache/Makefile                    |   4 +-
 drivers/md/bcache/bcache.h                    |   2 +-
 drivers/md/bcache/btree.c                     |   8 +-
 drivers/md/bcache/super.c                     |   1 -
 drivers/md/bcache/util.h                      |   3 +-
 drivers/md/dm-crypt.c                         |  10 +-
 drivers/md/raid1.c                            |   4 +-
 fs/btrfs/disk-io.c                            |   4 +-
 fs/btrfs/extent_io.c                          |  50 +-
 fs/btrfs/raid56.c                             |  14 +-
 fs/crypto/bio.c                               |   9 +-
 fs/dcache.c                                   |  12 +-
 fs/erofs/zdata.c                              |   4 +-
 fs/ext4/page-io.c                             |   8 +-
 fs/ext4/readpage.c                            |   4 +-
 fs/f2fs/data.c                                |  20 +-
 fs/gfs2/lops.c                                |  10 +-
 fs/gfs2/meta_io.c                             |   8 +-
 fs/inode.c                                    | 218 +++--
 fs/iomap/buffered-io.c                        |  14 +-
 fs/mpage.c                                    |   4 +-
 fs/squashfs/block.c                           |  48 +-
 fs/squashfs/lz4_wrapper.c                     |  17 +-
 fs/squashfs/lzo_wrapper.c                     |  17 +-
 fs/squashfs/xz_wrapper.c                      |  19 +-
 fs/squashfs/zlib_wrapper.c                    |  18 +-
 fs/squashfs/zstd_wrapper.c                    |  19 +-
 fs/super.c                                    |  40 +-
 fs/verity/verify.c                            |   9 +-
 include/linux/bio.h                           | 132 +--
 include/linux/blkdev.h                        |   1 +
 include/linux/bvec.h                          |  70 +-
 .../md/bcache => include/linux}/closure.h     |  46 +-
 include/linux/compiler_attributes.h           |   5 +
 include/linux/dcache.h                        |   1 +
 include/linux/fs.h                            |  10 +-
 include/linux/generic-radix-tree.h            |  68 +-
 include/linux/list_bl.h                       |  22 +
 include/linux/lockdep.h                       |  10 +
 include/linux/lockdep_types.h                 |   2 +-
 include/linux/mean_and_variance.h             | 219 +++++
 include/linux/sched.h                         |   1 +
 include/linux/six.h                           | 210 +++++
 include/linux/string_helpers.h                |   4 +-
 include/linux/uio.h                           |   2 +
 include/linux/vmalloc.h                       |   1 +
 init/init_task.c                              |   1 +
 kernel/Kconfig.locks                          |   3 +
 kernel/locking/Makefile                       |   1 +
 kernel/locking/lockdep.c                      |  46 ++
 kernel/locking/six.c                          | 779 ++++++++++++++++++
 kernel/module/main.c                          |   4 +-
 kernel/stacktrace.c                           |   2 +
 lib/Kconfig                                   |   3 +
 lib/Kconfig.debug                             |  18 +
 lib/Makefile                                  |   2 +
 {drivers/md/bcache => lib}/closure.c          |  36 +-
 lib/errname.c                                 |   1 +
 lib/generic-radix-tree.c                      |  76 +-
 lib/iov_iter.c                                |  53 +-
 lib/math/Kconfig                              |   3 +
 lib/math/Makefile                             |   2 +
 lib/math/mean_and_variance.c                  | 136 +++
 lib/math/mean_and_variance_test.c             | 155 ++++
 lib/string_helpers.c                          |   8 +-
 mm/nommu.c                                    |  18 +
 mm/vmalloc.c                                  |  21 +
 75 files changed, 2485 insertions(+), 445 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (93%)
 create mode 100644 include/linux/mean_and_variance.h
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c
 rename {drivers/md/bcache => lib}/closure.c (88%)
 create mode 100644 lib/math/mean_and_variance.c
 create mode 100644 lib/math/mean_and_variance_test.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 17:04   ` Miguel Ojeda
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Miguel Ojeda, Nick Desaulniers, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

This makes __attribute__((flatten)) available, which is used by
bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org> (maintainer:COMPILER ATTRIBUTES)
Cc: Nick Desaulniers <ndesaulniers@google.com> (reviewer:COMPILER ATTRIBUTES)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/compiler_attributes.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index e659cb6fde..e56793bc08 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -366,4 +366,9 @@
  */
 #define __fix_address noinline __noclone
 
+/*
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-flatten-function-attribute
+ */
+#define __flatten __attribute__((flatten))
+
 #endif /* __LINUX_COMPILER_ATTRIBUTES_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 19:30   ` Peter Zijlstra
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

From: Kent Overstreet <kent.overstreet@gmail.com>

This patch adds lock_class_is_held(), which can be used to assert that a
particular type of lock is not held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/lockdep.h  |  4 ++++
 kernel/locking/lockdep.c | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1023f349af..e858c288c7 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -339,6 +339,8 @@ extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie);
 #define lockdep_repin_lock(l,c)	lock_repin_lock(&(l)->dep_map, (c))
 #define lockdep_unpin_lock(l,c)	lock_unpin_lock(&(l)->dep_map, (c))
 
+int lock_class_is_held(struct lock_class_key *key);
+
 #else /* !CONFIG_LOCKDEP */
 
 static inline void lockdep_init_task(struct task_struct *task)
@@ -427,6 +429,8 @@ extern int lockdep_is_held(const void *);
 #define lockdep_repin_lock(l, c)		do { (void)(l); (void)(c); } while (0)
 #define lockdep_unpin_lock(l, c)		do { (void)(l); (void)(c); } while (0)
 
+static inline int lock_class_is_held(struct lock_class_key *key) { return 0; }
+
 #endif /* !LOCKDEP */
 
 enum xhlock_context_t {
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 50d4863974..e631464070 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6487,6 +6487,26 @@ void debug_check_no_locks_held(void)
 }
 EXPORT_SYMBOL_GPL(debug_check_no_locks_held);
 
+#ifdef CONFIG_LOCKDEP
+int lock_class_is_held(struct lock_class_key *key)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock;
+
+	if (unlikely(!debug_locks))
+		return 0;
+
+	for (hlock = curr->held_locks;
+	     hlock < curr->held_locks + curr->lockdep_depth;
+	     hlock++)
+		if (hlock->instance->key == key)
+			return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(lock_class_is_held);
+#endif
+
 #ifdef __KERNEL__
 void debug_show_all_locks(void)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
                   ` (29 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Waiman Long, Boqun Feng

This adds a method to tell lockdep not to check lock ordering within a
lock class - but to still check lock ordering w.r.t. other lock types.

This is for bcachefs, where for btree node locks we have our own
deadlock avoidance strategy w.r.t. other btree node locks (cycle
detection), but we still want lockdep to check lock ordering w.r.t.
other lock types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/lockdep.h       |  6 ++++++
 include/linux/lockdep_types.h |  2 +-
 kernel/locking/lockdep.c      | 26 ++++++++++++++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index e858c288c7..f6cc8709e2 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -665,4 +665,10 @@ lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 }
 #endif
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *map);
+#else
+static inline void lockdep_set_no_check_recursion(struct lockdep_map *map) {}
+#endif
+
 #endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index d22430840b..506e769b4a 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -128,7 +128,7 @@ struct lock_class {
 	u8				wait_type_inner;
 	u8				wait_type_outer;
 	u8				lock_type;
-	/* u8				hole; */
+	u8				no_check_recursion;
 
 #ifdef CONFIG_LOCK_STAT
 	unsigned long			contention_point[LOCKSTAT_POINTS];
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e631464070..f022b58dfa 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3024,6 +3024,9 @@ check_deadlock(struct task_struct *curr, struct held_lock *next)
 		if ((next->read == 2) && prev->read)
 			continue;
 
+		if (hlock_class(next)->no_check_recursion)
+			continue;
+
 		/*
 		 * We're holding the nest_lock, which serializes this lock's
 		 * nesting behaviour.
@@ -3085,6 +3088,10 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 		return 2;
 	}
 
+	if (hlock_class(prev) == hlock_class(next) &&
+	    hlock_class(prev)->no_check_recursion)
+		return 2;
+
 	/*
 	 * Prove that the new <prev> -> <next> dependency would not
 	 * create a circular dependency in the graph. (We do this by
@@ -6620,3 +6627,22 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *lock)
+{
+	struct lock_class *class = lock->class_cache[0];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	lockdep_recursion_inc();
+
+	if (!class)
+		class = register_lock_class(lock, 0, 0);
+	if (class)
+		class->no_check_recursion = true;
+	lockdep_recursion_finish();
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lockdep_set_no_check_recursion);
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (2 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-11 12:14   ` Jan Engelhardt
  2023-05-14 12:15   ` Jeff Layton
  2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
                   ` (28 subsequent siblings)
  32 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

From: Kent Overstreet <kent.overstreet@gmail.com>

New lock for bcachefs, like read/write locks but with a third state,
intent.

Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/six.h     | 210 +++++++++++
 kernel/Kconfig.locks    |   3 +
 kernel/locking/Makefile |   1 +
 kernel/locking/six.c    | 779 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 993 insertions(+)
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c

diff --git a/include/linux/six.h b/include/linux/six.h
new file mode 100644
index 0000000000..41ddf63b74
--- /dev/null
+++ b/include/linux/six.h
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_SIX_H
+#define _LINUX_SIX_H
+
+/*
+ * Shared/intent/exclusive locks: sleepable read/write locks, much like rw
+ * semaphores, except with a third intermediate state, intent. Basic operations
+ * are:
+ *
+ * six_lock_read(&foo->lock);
+ * six_unlock_read(&foo->lock);
+ *
+ * six_lock_intent(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ *
+ * Intent locks block other intent locks, but do not block read locks, and you
+ * must have an intent lock held before taking a write lock, like so:
+ *
+ * six_lock_intent(&foo->lock);
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * Other operations:
+ *
+ *   six_trylock_read()
+ *   six_trylock_intent()
+ *   six_trylock_write()
+ *
+ *   six_lock_downgrade():	convert from intent to read
+ *   six_lock_tryupgrade():	attempt to convert from read to intent
+ *
+ * Locks also embed a sequence number, which is incremented when the lock is
+ * locked or unlocked for write. The current sequence number can be grabbed
+ * while a lock is held from lock->state.seq; then, if you drop the lock you can
+ * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
+ * iff it hasn't been locked for write in the meantime.
+ *
+ * There are also operations that take the lock type as a parameter, where the
+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
+ *
+ *   six_lock_type(lock, type)
+ *   six_unlock_type(lock, type)
+ *   six_relock(lock, type, seq)
+ *   six_trylock_type(lock, type)
+ *   six_trylock_convert(lock, from, to)
+ *
+ * A lock may be held multiple types by the same thread (for read or intent,
+ * not write). However, the six locks code does _not_ implement the actual
+ * recursive checks itself though - rather, if your code (e.g. btree iterator
+ * code) knows that the current thread already has a lock held, and for the
+ * correct type, six_lock_increment() may be used to bump up the counter for
+ * that type - the only effect is that one more call to unlock will be required
+ * before the lock is unlocked.
+ */
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+union six_lock_state {
+	struct {
+		atomic64_t	counter;
+	};
+
+	struct {
+		u64		v;
+	};
+
+	struct {
+		/* for waitlist_bitnr() */
+		unsigned long	l;
+	};
+
+	struct {
+		unsigned	read_lock:27;
+		unsigned	write_locking:1;
+		unsigned	intent_lock:1;
+		unsigned	waiters:3;
+		/*
+		 * seq works much like in seqlocks: it's incremented every time
+		 * we lock and unlock for write.
+		 *
+		 * If it's odd write lock is held, even unlocked.
+		 *
+		 * Thus readers can unlock, and then lock again later iff it
+		 * hasn't been modified in the meantime.
+		 */
+		u32		seq;
+	};
+};
+
+enum six_lock_type {
+	SIX_LOCK_read,
+	SIX_LOCK_intent,
+	SIX_LOCK_write,
+};
+
+struct six_lock {
+	union six_lock_state	state;
+	unsigned		intent_lock_recurse;
+	struct task_struct	*owner;
+	struct optimistic_spin_queue osq;
+	unsigned __percpu	*readers;
+
+	raw_spinlock_t		wait_lock;
+	struct list_head	wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+typedef int (*six_lock_should_sleep_fn)(struct six_lock *lock, void *);
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+					    const char *name,
+					    struct lock_class_key *key)
+{
+	atomic64_set(&lock->state.counter, 0);
+	raw_spin_lock_init(&lock->wait_lock);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__six_lock_init((lock), #lock, &__key);				\
+} while (0)
+
+#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *);				\
+bool six_relock_##type(struct six_lock *, u32);				\
+int six_lock_##type(struct six_lock *, six_lock_should_sleep_fn, void *);\
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...)			\
+	switch (type) {						\
+	case SIX_LOCK_read:					\
+		return fn##_read(__VA_ARGS__);			\
+	case SIX_LOCK_intent:					\
+		return fn##_intent(__VA_ARGS__);		\
+	case SIX_LOCK_write:					\
+		return fn##_write(__VA_ARGS__);			\
+	default:						\
+		BUG();						\
+	}
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+				   unsigned seq)
+{
+	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline int six_lock_type(struct six_lock *lock, enum six_lock_type type,
+				six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	SIX_LOCK_DISPATCH(type, six_lock, lock, should_sleep_fn, p);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+			 enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+void six_lock_wakeup_all(struct six_lock *);
+
+void six_lock_pcpu_free_rcu(struct six_lock *);
+void six_lock_pcpu_free(struct six_lock *);
+void six_lock_pcpu_alloc(struct six_lock *);
+
+struct six_lock_count {
+	unsigned read;
+	unsigned intent;
+};
+
+struct six_lock_count six_lock_counts(struct six_lock *);
+
+#endif /* _LINUX_SIX_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273e..b2abd9a5d9 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -259,3 +259,6 @@ config ARCH_HAS_MMIOWB
 config MMIOWB
 	def_bool y if ARCH_HAS_MMIOWB
 	depends on SMP
+
+config SIXLOCKS
+	bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17..a095dbbf01 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
 obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
 obj-$(CONFIG_LOCK_EVENT_COUNTS) += lock_events.o
+obj-$(CONFIG_SIXLOCKS) += six.o
diff --git a/kernel/locking/six.c b/kernel/locking/six.c
new file mode 100644
index 0000000000..5b2d92c6e9
--- /dev/null
+++ b/kernel/locking/six.c
@@ -0,0 +1,779 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/export.h>
+#include <linux/log2.h>
+#include <linux/percpu.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/six.h>
+#include <linux/slab.h>
+
+#ifdef DEBUG
+#define EBUG_ON(cond)		BUG_ON(cond)
+#else
+#define EBUG_ON(cond)		do {} while (0)
+#endif
+
+#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l)		lock_release(l, _RET_IP_)
+
+struct six_lock_vals {
+	/* Value we add to the lock in order to take the lock: */
+	u64			lock_val;
+
+	/* If the lock has this value (used as a mask), taking the lock fails: */
+	u64			lock_fail;
+
+	/* Value we add to the lock in order to release the lock: */
+	u64			unlock_val;
+
+	/* Mask that indicates lock is held for this type: */
+	u64			held_mask;
+
+	/* Waitlist we wakeup when releasing the lock: */
+	enum six_lock_type	unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
+
+#define LOCK_VALS {							\
+	[SIX_LOCK_read] = {						\
+		.lock_val	= __SIX_VAL(read_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_write + __SIX_VAL(write_locking, 1),\
+		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_read,			\
+		.unlock_wakeup	= SIX_LOCK_write,			\
+	},								\
+	[SIX_LOCK_intent] = {						\
+		.lock_val	= __SIX_VAL(intent_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_intent,		\
+		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_intent,		\
+		.unlock_wakeup	= SIX_LOCK_intent,			\
+	},								\
+	[SIX_LOCK_write] = {						\
+		.lock_val	= __SIX_VAL(seq, 1),			\
+		.lock_fail	= __SIX_LOCK_HELD_read,			\
+		.unlock_val	= __SIX_VAL(seq, 1),			\
+		.held_mask	= __SIX_LOCK_HELD_write,		\
+		.unlock_wakeup	= SIX_LOCK_read,			\
+	},								\
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+				 union six_lock_state old)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	if (!old.intent_lock) {
+		EBUG_ON(lock->owner);
+		lock->owner = current;
+	} else {
+		EBUG_ON(lock->owner != current);
+	}
+}
+
+static inline unsigned pcpu_read_count(struct six_lock *lock)
+{
+	unsigned read_count = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		read_count += *per_cpu_ptr(lock->readers, cpu);
+	return read_count;
+}
+
+struct six_lock_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+				   union six_lock_state state,
+				   unsigned waitlist_id)
+{
+	if (waitlist_id == SIX_LOCK_write) {
+		if (state.write_locking && !state.read_lock) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+			if (p)
+				wake_up_process(p);
+		}
+	} else {
+		struct list_head *wait_list = &lock->wait_list[waitlist_id];
+		struct six_lock_waiter *w, *next;
+
+		if (!(state.waiters & (1 << waitlist_id)))
+			return;
+
+		clear_bit(waitlist_bitnr(waitlist_id),
+			  (unsigned long *) &lock->state.v);
+
+		raw_spin_lock(&lock->wait_lock);
+
+		list_for_each_entry_safe(w, next, wait_list, list) {
+			list_del_init(&w->list);
+
+			if (wake_up_process(w->task) &&
+			    waitlist_id != SIX_LOCK_read) {
+				if (!list_empty(wait_list))
+					set_bit(waitlist_bitnr(waitlist_id),
+						(unsigned long *) &lock->state.v);
+				break;
+			}
+		}
+
+		raw_spin_unlock(&lock->wait_lock);
+	}
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+						enum six_lock_type type,
+						bool try)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	bool ret;
+	u64 v;
+
+	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+	EBUG_ON(type == SIX_LOCK_write && (lock->state.seq & 1));
+
+	EBUG_ON(type == SIX_LOCK_write && (try != !(lock->state.write_locking)));
+
+	/*
+	 * Percpu reader mode:
+	 *
+	 * The basic idea behind this algorithm is that you can implement a lock
+	 * between two threads without any atomics, just memory barriers:
+	 *
+	 * For two threads you'll need two variables, one variable for "thread a
+	 * has the lock" and another for "thread b has the lock".
+	 *
+	 * To take the lock, a thread sets its variable indicating that it holds
+	 * the lock, then issues a full memory barrier, then reads from the
+	 * other thread's variable to check if the other thread thinks it has
+	 * the lock. If we raced, we backoff and retry/sleep.
+	 */
+
+	if (type == SIX_LOCK_read && lock->readers) {
+retry:
+		preempt_disable();
+		this_cpu_inc(*lock->readers); /* signal that we own lock */
+
+		smp_mb();
+
+		old.v = READ_ONCE(lock->state.v);
+		ret = !(old.v & l[type].lock_fail);
+
+		this_cpu_sub(*lock->readers, !ret);
+		preempt_enable();
+
+		/*
+		 * If we failed because a writer was trying to take the
+		 * lock, issue a wakeup because we might have caused a
+		 * spurious trylock failure:
+		 */
+		if (old.write_locking) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+
+			if (p)
+				wake_up_process(p);
+		}
+
+		/*
+		 * If we failed from the lock path and the waiting bit wasn't
+		 * set, set it:
+		 */
+		if (!try && !ret) {
+			v = old.v;
+
+			do {
+				new.v = old.v = v;
+
+				if (!(old.v & l[type].lock_fail))
+					goto retry;
+
+				if (new.waiters & (1 << type))
+					break;
+
+				new.waiters |= 1 << type;
+			} while ((v = atomic64_cmpxchg(&lock->state.counter,
+						       old.v, new.v)) != old.v);
+		}
+	} else if (type == SIX_LOCK_write && lock->readers) {
+		if (try) {
+			atomic64_add(__SIX_VAL(write_locking, 1),
+				     &lock->state.counter);
+			smp_mb__after_atomic();
+		}
+
+		ret = !pcpu_read_count(lock);
+
+		/*
+		 * On success, we increment lock->seq; also we clear
+		 * write_locking unless we failed from the lock path:
+		 */
+		v = 0;
+		if (ret)
+			v += __SIX_VAL(seq, 1);
+		if (ret || try)
+			v -= __SIX_VAL(write_locking, 1);
+
+		if (try && !ret) {
+			old.v = atomic64_add_return(v, &lock->state.counter);
+			six_lock_wakeup(lock, old, SIX_LOCK_read);
+		} else {
+			atomic64_add(v, &lock->state.counter);
+		}
+	} else {
+		v = READ_ONCE(lock->state.v);
+		do {
+			new.v = old.v = v;
+
+			if (!(old.v & l[type].lock_fail)) {
+				new.v += l[type].lock_val;
+
+				if (type == SIX_LOCK_write)
+					new.write_locking = 0;
+			} else if (!try && type != SIX_LOCK_write &&
+				   !(new.waiters & (1 << type)))
+				new.waiters |= 1 << type;
+			else
+				break; /* waiting bit already set */
+		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+					old.v, new.v)) != old.v);
+
+		ret = !(old.v & l[type].lock_fail);
+
+		EBUG_ON(ret && !(lock->state.v & l[type].held_mask));
+	}
+
+	if (ret)
+		six_set_owner(lock, type, old);
+
+	EBUG_ON(type == SIX_LOCK_write && (try || ret) && (lock->state.write_locking));
+
+	return ret;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	if (!do_six_trylock_type(lock, type, true))
+		return false;
+
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+			      unsigned seq)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v;
+
+	EBUG_ON(type == SIX_LOCK_write);
+
+	if (type == SIX_LOCK_read &&
+	    lock->readers) {
+		bool ret;
+
+		preempt_disable();
+		this_cpu_inc(*lock->readers);
+
+		smp_mb();
+
+		old.v = READ_ONCE(lock->state.v);
+		ret = !(old.v & l[type].lock_fail) && old.seq == seq;
+
+		this_cpu_sub(*lock->readers, !ret);
+		preempt_enable();
+
+		/*
+		 * Similar to the lock path, we may have caused a spurious write
+		 * lock fail and need to issue a wakeup:
+		 */
+		if (old.write_locking) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+
+			if (p)
+				wake_up_process(p);
+		}
+
+		if (ret)
+			six_acquire(&lock->dep_map, 1);
+
+		return ret;
+	}
+
+	v = READ_ONCE(lock->state.v);
+	do {
+		old.v = v;
+
+		if (old.seq != seq || old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+	struct task_struct *owner;
+	int retval = 1;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = READ_ONCE(lock->owner);
+	if (owner)
+		retval = owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the mutex owner may have just acquired
+	 * it and not set the owner yet or the mutex has been released.
+	 */
+	return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+				     struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (lock->owner == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	struct task_struct *task = current;
+
+	if (type == SIX_LOCK_write)
+		return false;
+
+	preempt_disable();
+	if (!six_can_spin_on_owner(lock))
+		goto fail;
+
+	if (!osq_lock(&lock->osq))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = READ_ONCE(lock->owner);
+		if (owner && !six_spin_on_owner(lock, owner))
+			break;
+
+		if (do_six_trylock_type(lock, type, false)) {
+			osq_unlock(&lock->osq);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(&lock->osq);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+#endif
+
+noinline
+static int __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type,
+				    six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	union six_lock_state old;
+	struct six_lock_waiter wait;
+	int ret = 0;
+
+	if (type == SIX_LOCK_write) {
+		EBUG_ON(lock->state.write_locking);
+		atomic64_add(__SIX_VAL(write_locking, 1), &lock->state.counter);
+		smp_mb__after_atomic();
+	}
+
+	ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
+	if (ret)
+		goto out_before_sleep;
+
+	if (six_optimistic_spin(lock, type))
+		goto out_before_sleep;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
+	INIT_LIST_HEAD(&wait.list);
+	wait.task = current;
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (type == SIX_LOCK_write)
+			EBUG_ON(lock->owner != current);
+		else if (list_empty_careful(&wait.list)) {
+			raw_spin_lock(&lock->wait_lock);
+			list_add_tail(&wait.list, &lock->wait_list[type]);
+			raw_spin_unlock(&lock->wait_lock);
+		}
+
+		if (do_six_trylock_type(lock, type, false))
+			break;
+
+		ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
+		if (ret)
+			break;
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+
+	if (!list_empty_careful(&wait.list)) {
+		raw_spin_lock(&lock->wait_lock);
+		list_del_init(&wait.list);
+		raw_spin_unlock(&lock->wait_lock);
+	}
+out_before_sleep:
+	if (ret && type == SIX_LOCK_write) {
+		old.v = atomic64_sub_return(__SIX_VAL(write_locking, 1),
+					    &lock->state.counter);
+		six_lock_wakeup(lock, old, SIX_LOCK_read);
+	}
+
+	return ret;
+}
+
+__always_inline
+static int __six_lock_type(struct six_lock *lock, enum six_lock_type type,
+			   six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	int ret;
+
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 0);
+
+	ret = do_six_trylock_type(lock, type, true) ? 0
+		: __six_lock_type_slowpath(lock, type, should_sleep_fn, p);
+
+	if (ret && type != SIX_LOCK_write)
+		six_release(&lock->dep_map);
+	if (!ret)
+		lock_acquired(&lock->dep_map, _RET_IP_);
+
+	return ret;
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state state;
+
+	EBUG_ON(type == SIX_LOCK_write &&
+		!(lock->state.v & __SIX_LOCK_HELD_intent));
+
+	if (type != SIX_LOCK_write)
+		six_release(&lock->dep_map);
+
+	if (type == SIX_LOCK_intent) {
+		EBUG_ON(lock->owner != current);
+
+		if (lock->intent_lock_recurse) {
+			--lock->intent_lock_recurse;
+			return;
+		}
+
+		lock->owner = NULL;
+	}
+
+	if (type == SIX_LOCK_read &&
+	    lock->readers) {
+		smp_mb(); /* unlock barrier */
+		this_cpu_dec(*lock->readers);
+		smp_mb(); /* between unlocking and checking for waiters */
+		state.v = READ_ONCE(lock->state.v);
+	} else {
+		EBUG_ON(!(lock->state.v & l[type].held_mask));
+		state.v = atomic64_add_return_release(l[type].unlock_val,
+						      &lock->state.counter);
+	}
+
+	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *lock)				\
+{									\
+	return __six_trylock_type(lock, SIX_LOCK_##type);		\
+}									\
+EXPORT_SYMBOL_GPL(six_trylock_##type);					\
+									\
+bool six_relock_##type(struct six_lock *lock, u32 seq)			\
+{									\
+	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+EXPORT_SYMBOL_GPL(six_relock_##type);					\
+									\
+int six_lock_##type(struct six_lock *lock,				\
+		    six_lock_should_sleep_fn should_sleep_fn, void *p)	\
+{									\
+	return __six_lock_type(lock, SIX_LOCK_##type, should_sleep_fn, p);\
+}									\
+EXPORT_SYMBOL_GPL(six_lock_##type);					\
+									\
+void six_unlock_##type(struct six_lock *lock)				\
+{									\
+	__six_unlock_type(lock, SIX_LOCK_##type);			\
+}									\
+EXPORT_SYMBOL_GPL(six_unlock_##type);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+	six_lock_increment(lock, SIX_LOCK_read);
+	six_unlock_intent(lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_downgrade);
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+	union six_lock_state old, new;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		new.v = old.v = v;
+
+		if (new.intent_lock)
+			return false;
+
+		if (!lock->readers) {
+			EBUG_ON(!new.read_lock);
+			new.read_lock--;
+		}
+
+		new.intent_lock = 1;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v, new.v)) != old.v);
+
+	if (lock->readers)
+		this_cpu_dec(*lock->readers);
+
+	six_set_owner(lock, SIX_LOCK_intent, old);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(six_lock_tryupgrade);
+
+bool six_trylock_convert(struct six_lock *lock,
+			 enum six_lock_type from,
+			 enum six_lock_type to)
+{
+	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+	if (to == from)
+		return true;
+
+	if (to == SIX_LOCK_read) {
+		six_lock_downgrade(lock);
+		return true;
+	} else {
+		return six_lock_tryupgrade(lock);
+	}
+}
+EXPORT_SYMBOL_GPL(six_trylock_convert);
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+
+	six_acquire(&lock->dep_map, 0);
+
+	/* XXX: assert already locked, and that we don't overflow: */
+
+	switch (type) {
+	case SIX_LOCK_read:
+		if (lock->readers) {
+			this_cpu_inc(*lock->readers);
+		} else {
+			EBUG_ON(!lock->state.read_lock &&
+				!lock->state.intent_lock);
+			atomic64_add(l[type].lock_val, &lock->state.counter);
+		}
+		break;
+	case SIX_LOCK_intent:
+		EBUG_ON(!lock->state.intent_lock);
+		lock->intent_lock_recurse++;
+		break;
+	case SIX_LOCK_write:
+		BUG();
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(six_lock_increment);
+
+void six_lock_wakeup_all(struct six_lock *lock)
+{
+	struct six_lock_waiter *w;
+
+	raw_spin_lock(&lock->wait_lock);
+
+	list_for_each_entry(w, &lock->wait_list[0], list)
+		wake_up_process(w->task);
+	list_for_each_entry(w, &lock->wait_list[1], list)
+		wake_up_process(w->task);
+
+	raw_spin_unlock(&lock->wait_lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_wakeup_all);
+
+struct free_pcpu_rcu {
+	struct rcu_head		rcu;
+	void __percpu		*p;
+};
+
+static void free_pcpu_rcu_fn(struct rcu_head *_rcu)
+{
+	struct free_pcpu_rcu *rcu =
+		container_of(_rcu, struct free_pcpu_rcu, rcu);
+
+	free_percpu(rcu->p);
+	kfree(rcu);
+}
+
+void six_lock_pcpu_free_rcu(struct six_lock *lock)
+{
+	struct free_pcpu_rcu *rcu = kzalloc(sizeof(*rcu), GFP_KERNEL);
+
+	if (!rcu)
+		return;
+
+	rcu->p = lock->readers;
+	lock->readers = NULL;
+
+	call_rcu(&rcu->rcu, free_pcpu_rcu_fn);
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_free_rcu);
+
+void six_lock_pcpu_free(struct six_lock *lock)
+{
+	BUG_ON(lock->readers && pcpu_read_count(lock));
+	BUG_ON(lock->state.read_lock);
+
+	free_percpu(lock->readers);
+	lock->readers = NULL;
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_free);
+
+void six_lock_pcpu_alloc(struct six_lock *lock)
+{
+#ifdef __KERNEL__
+	if (!lock->readers)
+		lock->readers = alloc_percpu(unsigned);
+#endif
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_alloc);
+
+/*
+ * Returns lock held counts, for both read and intent
+ */
+struct six_lock_count six_lock_counts(struct six_lock *lock)
+{
+	struct six_lock_count ret = { 0, lock->state.intent_lock };
+
+	if (!lock->readers)
+		ret.read += lock->state.read_lock;
+	else {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			ret.read += *per_cpu_ptr(lock->readers, cpu);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(six_lock_counts);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 05/32] MAINTAINERS: Add entry for six locks
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (3 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

SIX locks are a new locking primitive, shared/intent/exclusive,
currently used by bcachefs but available for other uses. Mark them as
maintained.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c6545eb541..3fc37de3d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19166,6 +19166,14 @@ S:	Maintained
 W:	http://www.winischhofer.at/linuxsisusbvga.shtml
 F:	drivers/usb/misc/sisusbvga/
 
+SIX LOCKS
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/six.h
+F:	kernel/locking/six.c
+
 SL28 CPLD MFD DRIVER
 M:	Michael Walle <michael@walle.cc>
 S:	Maintained
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (4 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  1:07   ` Jan Kara
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Jan Kara, Darrick J . Wong

From: Kent Overstreet <kent.overstreet@gmail.com>

This is used by bcachefs to fix a page cache coherency issue with
O_DIRECT writes.

Also relevant: mapping->invalidate_lock, see below.

O_DIRECT writes (and other filesystem operations that modify file data
while bypassing the page cache) need to shoot down ranges of the page
cache - and additionally, need locking to prevent those pages from
pulled back in.

But O_DIRECT writes invoke the page fault handler (via get_user_pages),
and the page fault handler will need to take that same lock - this is a
classic recursive deadlock if userspace has mmaped the file they're DIO
writing to and uses those pages for the buffer to write from, and it's a
lock ordering deadlock in general.

Thus we need a way to signal from the dio code to the page fault handler
when we already are holding the pagecache add lock on an address space -
this patch just adds a member to task_struct for this purpose. For now
only bcachefs is implementing this locking, though it may be moved out
of bcachefs and made available to other filesystems in the future.

---------------------------------

The closest current VFS equivalent is mapping->invalidate_lock, which
comes from XFS. However, it's not used by direct IO.  Instead, direct IO
paths shoot down the page cache twice - before starting the IO and at
the end, and they're still technically racy w.r.t. page cache coherency.

This is a more complete approach: in the future we might consider
replacing mapping->invalidate_lock with the bcachefs code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jan Kara <jack@suse.cz>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b..f2a56f64f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -869,6 +869,7 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	int				exit_state;
 	int				exit_code;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe..f703116e05 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -85,6 +85,7 @@ struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (5 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
                     ` (4 more replies)
  2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
                   ` (25 subsequent siblings)
  32 siblings, 5 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm

From: Kent Overstreet <kent.overstreet@gmail.com>

This is needed for bcachefs, which dynamically generates per-btree node
unpack functions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-mm@kvack.org
---
 include/linux/vmalloc.h |  1 +
 kernel/module/main.c    |  4 +---
 mm/nommu.c              | 18 ++++++++++++++++++
 mm/vmalloc.c            | 21 +++++++++++++++++++++
 4 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 69250efa03..ff147fe115 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
 extern void *vmalloc_user(unsigned long size) __alloc_size(1);
 extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
+extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 extern void *vmalloc_32(unsigned long size) __alloc_size(1);
 extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
 extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index d3be89de70..9eaa89e84c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
 
 void * __weak module_alloc(unsigned long size)
 {
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-			NUMA_NO_NODE, __builtin_return_address(0));
+	return vmalloc_exec(size, GFP_KERNEL);
 }
 
 bool __weak module_init_section(const char *name)
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a..8d9ab19e39 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vzalloc_node);
 
+/**
+ *	vmalloc_exec  -  allocate virtually contiguous, executable memory
+ *	@size:		allocation size
+ *
+ *	Kernel-internal function to allocate enough pages to cover @size
+ *	the page level allocator and map them into contiguous and
+ *	executable kernel virtual space.
+ *
+ *	For tight control over page level allocator and protection flags
+ *	use __vmalloc() instead.
+ */
+
+void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
+{
+	return __vmalloc(size, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(vmalloc_exec);
+
 /**
  * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
  *	@size:		allocation size
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 31ff782d36..2ebb9ea7f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vzalloc_node);
 
+/**
+ * vmalloc_exec - allocate virtually contiguous, executable memory
+ * @size:	  allocation size
+ *
+ * Kernel-internal function to allocate enough pages to cover @size
+ * the page level allocator and map them into contiguous and
+ * executable kernel virtual space.
+ *
+ * For tight control over page level allocator and protection flags
+ * use __vmalloc() instead.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
+			NUMA_NO_NODE, __builtin_return_address(0));
+}
+EXPORT_SYMBOL_GPL(vmalloc_exec);
+
 #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
 #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
 #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 08/32] fs: factor out d_mark_tmpfile()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (6 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Alexander Viro, Christian Brauner

From: Kent Overstreet <kent.overstreet@gmail.com>

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/dcache.c            | 12 ++++++++++--
 include/linux/dcache.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5fdab..dbdafa2617 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct file *file, struct inode *inode)
+void d_mark_tmpfile(struct file *file, struct inode *inode)
 {
 	struct dentry *dentry = file->f_path.dentry;
 
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct file *file, struct inode *inode)
+{
+	struct dentry *dentry = file->f_path.dentry;
+
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(file, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b351e009f..3da2f0545d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct file *, struct inode *);
 extern void d_tmpfile(struct file *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 09/32] block: Add some exports for bcachefs
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (7 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, linux-block, Jens Axboe, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

 - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
 - blk_status_to_str() - error messages
 - bio_add_folio() - this should definitely be exported for everyone,
   it's the modern version of bio_add_page()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/bio.c            | 3 +++
 block/blk-core.c       | 1 +
 block/blk.h            | 1 -
 include/linux/blkdev.h | 1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index fd11614bba..1e75840d17 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1159,6 +1159,7 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 		return false;
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
+EXPORT_SYMBOL(bio_add_folio);
 
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
@@ -1480,6 +1481,7 @@ void bio_set_pages_dirty(struct bio *bio)
 			set_page_dirty_lock(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
 
 /*
  * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
@@ -1539,6 +1541,7 @@ void bio_check_pages_dirty(struct bio *bio)
 	spin_unlock_irqrestore(&bio_dirty_lock, flags);
 	schedule_work(&bio_dirty_work);
 }
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
 static inline bool bio_remaining_done(struct bio *bio)
 {
diff --git a/block/blk-core.c b/block/blk-core.c
index 42926e6cb8..f19bcc684b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -205,6 +205,7 @@ const char *blk_status_to_str(blk_status_t status)
 		return "<null>";
 	return blk_errors[idx].name;
 }
+EXPORT_SYMBOL_GPL(blk_status_to_str);
 
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
diff --git a/block/blk.h b/block/blk.h
index cc4e8873df..cc04dc73e9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -259,7 +259,6 @@ static inline void blk_integrity_del(struct gendisk *disk)
 
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
-const char *blk_status_to_str(blk_status_t status);
 
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 941304f174..7cac183112 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,6 +867,7 @@ extern const char *blk_op_str(enum req_op op);
 
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
+const char *blk_status_to_str(blk_status_t status);
 
 /* only poll the hardware once, don't continue until a completion was found */
 #define BLK_POLL_ONESHOT		(1 << 0)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (8 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Jens Axboe, linux-block

bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
---
 block/bio.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1e75840d17..e74a04ea14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,7 +1245,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset, trim;
+	size_t offset;
 	int ret = 0;
 
 	/*
@@ -1274,10 +1274,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-	iov_iter_revert(iter, trim);
+	if (bio->bi_bdev) {
+		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+		iov_iter_revert(iter, trim);
+		size -= trim;
+	}
 
-	size -= trim;
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 11/32] block: Bring back zero_fill_bio_iter
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (9 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Jens Axboe, linux-block

From: Kent Overstreet <kent.overstreet@gmail.com>

This reverts the commit that deleted it; it's used by bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
---
 block/bio.c         | 6 +++---
 include/linux/bio.h | 7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e74a04ea14..70b5c987bc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -606,15 +606,15 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bio_kmalloc);
 
-void zero_fill_bio(struct bio *bio)
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
 {
 	struct bio_vec bv;
 	struct bvec_iter iter;
 
-	bio_for_each_segment(bv, bio, iter)
+	__bio_for_each_segment(bv, bio, iter, start)
 		memzero_bvec(&bv);
 }
-EXPORT_SYMBOL(zero_fill_bio);
+EXPORT_SYMBOL(zero_fill_bio_iter);
 
 /**
  * bio_truncate - truncate the bio to small size of @new_size
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d766be7152..3536f28c05 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,12 @@ extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 extern void bio_copy_data(struct bio *dst, struct bio *src);
 extern void bio_free_pages(struct bio *bio);
 void guard_bio_eod(struct bio *bio);
-void zero_fill_bio(struct bio *bio);
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter);
+
+static inline void zero_fill_bio(struct bio *bio)
+{
+	zero_fill_bio_iter(bio, bio->bi_iter);
+}
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 12/32] block: Rework bio_for_each_segment_all()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (10 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Jens Axboe, linux-block, Ming Lei, Phillip Lougher

This patch reworks bio_for_each_segment_all() to be more inline with how
the other bio iterators work:

 - bio_iter_all_peek() now returns a synthesized bio_vec; we don't stash
   one in the iterator and pass a pointer to it - bad. This way makes it
   clearer what's a constructed value vs. a reference to something
   pre-existing, and it also will help with cleaning up and
   consolidating code with bio_for_each_folio_all().

 - We now provide bio_for_each_segment_all_continue(), for squashfs:
   this makes their code clearer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
---
 block/bio.c                | 38 ++++++++++++------------
 block/blk-map.c            | 38 ++++++++++++------------
 block/bounce.c             | 12 ++++----
 drivers/md/bcache/btree.c  |  8 ++---
 drivers/md/dm-crypt.c      | 10 +++----
 drivers/md/raid1.c         |  4 +--
 fs/btrfs/disk-io.c         |  4 +--
 fs/btrfs/extent_io.c       | 50 +++++++++++++++----------------
 fs/btrfs/raid56.c          | 14 ++++-----
 fs/erofs/zdata.c           |  4 +--
 fs/ext4/page-io.c          |  8 ++---
 fs/ext4/readpage.c         |  4 +--
 fs/f2fs/data.c             | 20 ++++++-------
 fs/gfs2/lops.c             | 10 +++----
 fs/gfs2/meta_io.c          |  8 ++---
 fs/mpage.c                 |  4 +--
 fs/squashfs/block.c        | 48 +++++++++++++++++-------------
 fs/squashfs/lz4_wrapper.c  | 17 ++++++-----
 fs/squashfs/lzo_wrapper.c  | 17 ++++++-----
 fs/squashfs/xz_wrapper.c   | 19 ++++++------
 fs/squashfs/zlib_wrapper.c | 18 ++++++-----
 fs/squashfs/zstd_wrapper.c | 19 ++++++------
 include/linux/bio.h        | 34 ++++++++++++++++-----
 include/linux/bvec.h       | 61 ++++++++++++++++++++++----------------
 24 files changed, 256 insertions(+), 213 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 70b5c987bc..f2845d4e47 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1163,13 +1163,13 @@ EXPORT_SYMBOL(bio_add_folio);
 
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	struct bvec_iter_all iter_all;
-	struct bio_vec *bvec;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (mark_dirty && !PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (mark_dirty && !PageCompound(bvec.bv_page))
+			set_page_dirty_lock(bvec.bv_page);
+		put_page(bvec.bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1436,11 +1436,11 @@ EXPORT_SYMBOL(bio_copy_data);
 
 void bio_free_pages(struct bio *bio)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all)
-		__free_page(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter)
+		__free_page(bvec.bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
 
@@ -1475,12 +1475,12 @@ EXPORT_SYMBOL(bio_free_pages);
  */
 void bio_set_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (!PageCompound(bvec.bv_page))
+			set_page_dirty_lock(bvec.bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
@@ -1524,12 +1524,12 @@ static void bio_dirty_fn(struct work_struct *work)
 
 void bio_check_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	unsigned long flags;
-	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (!PageDirty(bvec.bv_page) && !PageCompound(bvec.bv_page))
 			goto defer;
 	}
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 9137d16cec..5774a9e467 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -46,21 +46,21 @@ static struct bio_map_data *bio_alloc_map_data(struct iov_iter *data,
  */
 static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all bv_iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, bv_iter) {
 		ssize_t ret;
 
-		ret = copy_page_from_iter(bvec->bv_page,
-					  bvec->bv_offset,
-					  bvec->bv_len,
+		ret = copy_page_from_iter(bvec.bv_page,
+					  bvec.bv_offset,
+					  bvec.bv_len,
 					  iter);
 
 		if (!iov_iter_count(iter))
 			break;
 
-		if (ret < bvec->bv_len)
+		if (ret < bvec.bv_len)
 			return -EFAULT;
 	}
 
@@ -77,21 +77,21 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
  */
 static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all bv_iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, bv_iter) {
 		ssize_t ret;
 
-		ret = copy_page_to_iter(bvec->bv_page,
-					bvec->bv_offset,
-					bvec->bv_len,
+		ret = copy_page_to_iter(bvec.bv_page,
+					bvec.bv_offset,
+					bvec.bv_len,
 					&iter);
 
 		if (!iov_iter_count(&iter))
 			break;
 
-		if (ret < bvec->bv_len)
+		if (ret < bvec.bv_len)
 			return -EFAULT;
 	}
 
@@ -442,12 +442,12 @@ static void bio_copy_kern_endio(struct bio *bio)
 static void bio_copy_kern_endio_read(struct bio *bio)
 {
 	char *p = bio->bi_private;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		memcpy_from_bvec(p, bvec);
-		p += bvec->bv_len;
+	bio_for_each_segment_all(bvec, bio, iter) {
+		memcpy_from_bvec(p, &bvec);
+		p += bvec.bv_len;
 	}
 
 	bio_copy_kern_endio(bio);
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9..e701832d76 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -102,18 +102,18 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
 static void bounce_end_io(struct bio *bio)
 {
 	struct bio *bio_orig = bio->bi_private;
-	struct bio_vec *bvec, orig_vec;
+	struct bio_vec bvec, orig_vec;
 	struct bvec_iter orig_iter = bio_orig->bi_iter;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
 
 	/*
 	 * free up bounce indirect pages used
 	 */
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter) {
 		orig_vec = bio_iter_iovec(bio_orig, orig_iter);
-		if (bvec->bv_page != orig_vec.bv_page) {
-			dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
-			mempool_free(bvec->bv_page, &page_pool);
+		if (bvec.bv_page != orig_vec.bv_page) {
+			dec_zone_page_state(bvec.bv_page, NR_BOUNCE);
+			mempool_free(bvec.bv_page, &page_pool);
 		}
 		bio_advance_iter(bio_orig, &orig_iter, orig_vec.bv_len);
 	}
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 147c493a98..98ce12b239 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -373,12 +373,12 @@ static void do_btree_node_write(struct btree *b)
 		       bset_sector_offset(&b->keys, i));
 
 	if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
-		struct bio_vec *bv;
+		struct bio_vec bv;
 		void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
-		struct bvec_iter_all iter_all;
+		struct bvec_iter_all iter;
 
-		bio_for_each_segment_all(bv, b->bio, iter_all) {
-			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
+		bio_for_each_segment_all(bv, b->bio, iter) {
+			memcpy(page_address(bv.bv_page), addr, PAGE_SIZE);
 			addr += PAGE_SIZE;
 		}
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 3ba53dc3cc..166bb4fdb4 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1713,12 +1713,12 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
 
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
-	struct bio_vec *bv;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bv;
 
-	bio_for_each_segment_all(bv, clone, iter_all) {
-		BUG_ON(!bv->bv_page);
-		mempool_free(bv->bv_page, &cc->page_pool);
+	bio_for_each_segment_all(bv, clone, iter) {
+		BUG_ON(!bv.bv_page);
+		mempool_free(bv.bv_page, &cc->page_pool);
 	}
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 68a9e2d998..4f58cae37e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2188,7 +2188,7 @@ static void process_checks(struct r1bio *r1_bio)
 		blk_status_t status = sbio->bi_status;
 		struct page **ppages = get_resync_pages(pbio)->pages;
 		struct page **spages = get_resync_pages(sbio)->pages;
-		struct bio_vec *bi;
+		struct bio_vec bi;
 		int page_len[RESYNC_PAGES] = { 0 };
 		struct bvec_iter_all iter_all;
 
@@ -2198,7 +2198,7 @@ static void process_checks(struct r1bio *r1_bio)
 		sbio->bi_status = 0;
 
 		bio_for_each_segment_all(bi, sbio, iter_all)
-			page_len[j++] = bi->bv_len;
+			page_len[j++] = bi.bv_len;
 
 		if (!status) {
 			for (j = vcnt; j-- ; ) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9e1596bb20..92b3396c15 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3804,12 +3804,12 @@ ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
 static void btrfs_end_super_write(struct bio *bio)
 {
 	struct btrfs_device *device = bio->bi_private;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 	struct page *page;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 
 		if (bio->bi_status) {
 			btrfs_warn_rl_in_rcu(device->fs_info,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 40300e8e5f..5796c99ea1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -581,34 +581,34 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
 	int error = blk_status_to_errno(bio->bi_status);
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
 
 		/* Our read/write should always be sector aligned. */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page write in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_len, sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page write with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
 
 		end_extent_writepage(page, error, start, end);
 
-		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
+		btrfs_page_clear_writeback(fs_info, page, start, bvec.bv_len);
 	}
 
 	bio_put(bio);
@@ -736,7 +736,7 @@ static struct extent_buffer *find_extent_buffer_readpage(
 static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -749,7 +749,7 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		bool uptodate = !bio->bi_status;
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
@@ -769,19 +769,19 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 		 * for unaligned offsets, and an error if they don't add up to
 		 * a full sector.
 		 */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page read in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_offset + bvec->bv_len,
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_offset + bvec.bv_len,
 				     sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page read with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
-		len = bvec->bv_len;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
+		len = bvec.bv_len;
 
 		mirror = bbio->mirror_num;
 		if (uptodate && !is_data_inode(inode) &&
@@ -1993,7 +1993,7 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
 	struct btrfs_fs_info *fs_info;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
@@ -2001,12 +2001,12 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
-		u64 bvec_start = page_offset(page) + bvec->bv_offset;
-		u64 bvec_end = bvec_start + bvec->bv_len - 1;
+		struct page *page = bvec.bv_page;
+		u64 bvec_start = page_offset(page) + bvec.bv_offset;
+		u64 bvec_end = bvec_start + bvec.bv_len - 1;
 		u64 cur_bytenr = bvec_start;
 
-		ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+		ASSERT(IS_ALIGNED(bvec.bv_len, fs_info->nodesize));
 
 		/* Iterate through all extent buffers in the range */
 		while (cur_bytenr <= bvec_end) {
@@ -2050,14 +2050,14 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 static void end_bio_extent_buffer_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct extent_buffer *eb;
 	int done;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 642828c1b2..39d8101541 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1388,7 +1388,7 @@ static struct sector_ptr *find_stripe_sector(struct btrfs_raid_bio *rbio,
 static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -1397,9 +1397,9 @@ static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 		struct sector_ptr *sector;
 		int pgoff;
 
-		for (pgoff = bvec->bv_offset; pgoff - bvec->bv_offset < bvec->bv_len;
+		for (pgoff = bvec.bv_offset; pgoff - bvec.bv_offset < bvec.bv_len;
 		     pgoff += sectorsize) {
-			sector = find_stripe_sector(rbio, bvec->bv_page, pgoff);
+			sector = find_stripe_sector(rbio, bvec.bv_page, pgoff);
 			ASSERT(sector);
 			if (sector)
 				sector->uptodate = 1;
@@ -1453,7 +1453,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
 	int total_sector_nr = get_bio_sector_nr(rbio, bio);
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	/* No data csum for the whole stripe, no need to verify. */
@@ -1467,8 +1467,8 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		int bv_offset;
 
-		for (bv_offset = bvec->bv_offset;
-		     bv_offset < bvec->bv_offset + bvec->bv_len;
+		for (bv_offset = bvec.bv_offset;
+		     bv_offset < bvec.bv_offset + bvec.bv_len;
 		     bv_offset += fs_info->sectorsize, total_sector_nr++) {
 			u8 csum_buf[BTRFS_CSUM_SIZE];
 			u8 *expected_csum = rbio->csum_buf +
@@ -1479,7 +1479,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 			if (!test_bit(total_sector_nr, rbio->csum_bitmap))
 				continue;
 
-			ret = btrfs_check_sector_csum(fs_info, bvec->bv_page,
+			ret = btrfs_check_sector_csum(fs_info, bvec.bv_page,
 				bv_offset, csum_buf, expected_csum);
 			if (ret < 0)
 				set_bit(total_sector_nr, rbio->error_bitmap);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index f1708c77a9..1fd0f01d11 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1651,11 +1651,11 @@ static void z_erofs_decompressqueue_endio(struct bio *bio)
 {
 	struct z_erofs_decompressqueue *q = bio->bi_private;
 	blk_status_t err = bio->bi_status;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 
 		DBG_BUGON(PageUptodate(page));
 		DBG_BUGON(z_erofs_page_is_invalidated(page));
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1e4db96a04..81a1cc4518 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -99,15 +99,15 @@ static void buffer_io_error(struct buffer_head *bh)
 
 static void ext4_finish_bio(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct page *bounce_page = NULL;
 		struct buffer_head *bh, *head;
-		unsigned bio_start = bvec->bv_offset;
-		unsigned bio_end = bio_start + bvec->bv_len;
+		unsigned bio_start = bvec.bv_offset;
+		unsigned bio_end = bio_start + bvec.bv_len;
 		unsigned under_io = 0;
 		unsigned long flags;
 
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c61dc8a7c0..ce42b3d5c9 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -69,11 +69,11 @@ struct bio_post_read_ctx {
 static void __read_end_io(struct bio *bio)
 {
 	struct page *page;
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		page = bv->bv_page;
+		page = bv.bv_page;
 
 		if (bio->bi_status)
 			ClearPageUptodate(page);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 06b552a0ab..e44bd8586f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -139,12 +139,12 @@ struct bio_post_read_ctx {
  */
 static void f2fs_finish_read_bio(struct bio *bio, bool in_task)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 	struct bio_post_read_ctx *ctx = bio->bi_private;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 
 		if (f2fs_is_compressed_page(page)) {
 			if (ctx && !ctx->decompression_attempted)
@@ -189,11 +189,11 @@ static void f2fs_verify_bio(struct work_struct *work)
 	 * as those were handled separately by f2fs_end_read_compressed_page().
 	 */
 	if (may_have_compressed_pages) {
-		struct bio_vec *bv;
+		struct bio_vec bv;
 		struct bvec_iter_all iter_all;
 
 		bio_for_each_segment_all(bv, bio, iter_all) {
-			struct page *page = bv->bv_page;
+			struct page *page = bv.bv_page;
 
 			if (!f2fs_is_compressed_page(page) &&
 			    !fsverity_verify_page(page)) {
@@ -241,13 +241,13 @@ static void f2fs_verify_and_finish_bio(struct bio *bio, bool in_task)
 static void f2fs_handle_step_decompress(struct bio_post_read_ctx *ctx,
 		bool in_task)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 	bool all_compressed = true;
 	block_t blkaddr = ctx->fs_blkaddr;
 
 	bio_for_each_segment_all(bv, ctx->bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 
 		if (f2fs_is_compressed_page(page))
 			f2fs_end_read_compressed_page(page, false, blkaddr,
@@ -327,7 +327,7 @@ static void f2fs_read_end_io(struct bio *bio)
 static void f2fs_write_end_io(struct bio *bio)
 {
 	struct f2fs_sb_info *sbi;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	iostat_update_and_unbind_ctx(bio);
@@ -337,7 +337,7 @@ static void f2fs_write_end_io(struct bio *bio)
 		bio->bi_status = BLK_STS_IOERR;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		enum count_type type = WB_DATA_TYPE(page);
 
 		if (page_private_dummy(page)) {
@@ -583,7 +583,7 @@ static void __submit_merged_bio(struct f2fs_bio_info *io)
 static bool __has_merged_page(struct bio *bio, struct inode *inode,
 						struct page *page, nid_t ino)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	if (!bio)
@@ -593,7 +593,7 @@ static bool __has_merged_page(struct bio *bio, struct inode *inode,
 		return true;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *target = bvec->bv_page;
+		struct page *target = bvec.bv_page;
 
 		if (fscrypt_is_bounce_page(target)) {
 			target = fscrypt_pagecache_page(target);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 1902413d5d..7f62fe8eb7 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -202,7 +202,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
 static void gfs2_end_log_write(struct bio *bio)
 {
 	struct gfs2_sbd *sdp = bio->bi_private;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct page *page;
 	struct bvec_iter_all iter_all;
 
@@ -217,9 +217,9 @@ static void gfs2_end_log_write(struct bio *bio)
 	}
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 		if (page_has_buffers(page))
-			gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
+			gfs2_end_log_write_bh(sdp, &bvec, bio->bi_status);
 		else
 			mempool_free(page, gfs2_page_pool);
 	}
@@ -395,11 +395,11 @@ static void gfs2_log_write_page(struct gfs2_sbd *sdp, struct page *page)
 static void gfs2_end_log_read(struct bio *bio)
 {
 	struct page *page;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 		if (bio->bi_status) {
 			int err = blk_status_to_errno(bio->bi_status);
 
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 924361fa51..832572784e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -193,15 +193,15 @@ struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno)
 
 static void gfs2_meta_read_endio(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct buffer_head *bh = page_buffers(page);
-		unsigned int len = bvec->bv_len;
+		unsigned int len = bvec.bv_len;
 
-		while (bh_offset(bh) < bvec->bv_offset)
+		while (bh_offset(bh) < bvec.bv_offset)
 			bh = bh->b_this_page;
 		do {
 			struct buffer_head *next = bh->b_this_page;
diff --git a/fs/mpage.c b/fs/mpage.c
index 22b9de5ddd..49505456ba 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -45,11 +45,11 @@
  */
 static void mpage_end_io(struct bio *bio)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 		page_endio(page, bio_op(bio),
 			   blk_status_to_errno(bio->bi_status));
 	}
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index bed3bb8b27..83e8b44518 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -35,30 +35,33 @@ static int copy_bio_to_actor(struct bio *bio,
 			     int offset, int req_length)
 {
 	void *actor_addr;
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	int copied_bytes = 0;
 	int actor_offset = 0;
+	int bytes_to_copy;
 
 	squashfs_actor_nobuff(actor);
 	actor_addr = squashfs_first_page(actor);
 
-	if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all)))
-		return 0;
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-	while (copied_bytes < req_length) {
-		int bytes_to_copy = min_t(int, bvec->bv_len - offset,
+	while (copied_bytes < req_length &&
+	       iter.idx < bio->bi_vcnt) {
+		bvec = bio_iter_all_peek(bio, &iter);
+
+		bytes_to_copy = min_t(int, bvec.bv_len,
 					  PAGE_SIZE - actor_offset);
 
 		bytes_to_copy = min_t(int, bytes_to_copy,
 				      req_length - copied_bytes);
 		if (!IS_ERR(actor_addr))
-			memcpy(actor_addr + actor_offset, bvec_virt(bvec) +
-					offset, bytes_to_copy);
+			memcpy(actor_addr + actor_offset, bvec_virt(&bvec),
+			       bytes_to_copy);
 
 		actor_offset += bytes_to_copy;
 		copied_bytes += bytes_to_copy;
-		offset += bytes_to_copy;
 
 		if (actor_offset >= PAGE_SIZE) {
 			actor_addr = squashfs_next_page(actor);
@@ -66,11 +69,8 @@ static int copy_bio_to_actor(struct bio *bio,
 				break;
 			actor_offset = 0;
 		}
-		if (offset >= bvec->bv_len) {
-			if (!bio_next_segment(bio, &iter_all))
-				break;
-			offset = 0;
-		}
+
+		bio_iter_all_advance(bio, &iter, bytes_to_copy);
 	}
 	squashfs_finish_page(actor);
 	return copied_bytes;
@@ -159,8 +159,10 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
 		 * Metadata block.
 		 */
 		const u8 *data;
-		struct bvec_iter_all iter_all = {};
-		struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+		struct bvec_iter_all iter;
+		struct bio_vec bvec;
+
+		bvec_iter_all_init(&iter);
 
 		if (index + 2 > msblk->bytes_used) {
 			res = -EIO;
@@ -170,21 +172,25 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
 		if (res)
 			goto out;
 
-		if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+		bvec = bio_iter_all_peek(bio, &iter);
+
+		if (WARN_ON_ONCE(!bvec.bv_len)) {
 			res = -EIO;
 			goto out_free_bio;
 		}
 		/* Extract the length of the metadata block */
-		data = bvec_virt(bvec);
+		data = bvec_virt(&bvec);
 		length = data[offset];
-		if (offset < bvec->bv_len - 1) {
+		if (offset < bvec.bv_len - 1) {
 			length |= data[offset + 1] << 8;
 		} else {
-			if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+			bio_iter_all_advance(bio, &iter, bvec.bv_len);
+
+			if (WARN_ON_ONCE(!bvec.bv_len)) {
 				res = -EIO;
 				goto out_free_bio;
 			}
-			data = bvec_virt(bvec);
+			data = bvec_virt(&bvec);
 			length |= data[0] << 8;
 		}
 		bio_free_pages(bio);
diff --git a/fs/squashfs/lz4_wrapper.c b/fs/squashfs/lz4_wrapper.c
index 49797729f1..bd0dd787d2 100644
--- a/fs/squashfs/lz4_wrapper.c
+++ b/fs/squashfs/lz4_wrapper.c
@@ -92,20 +92,23 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	struct squashfs_lz4 *stream = strm;
 	void *buff = stream->input, *data;
 	int bytes = length, res;
 
-	while (bio_next_segment(bio, &iter_all)) {
-		int avail = min(bytes, ((int)bvec->bv_len) - offset);
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-		data = bvec_virt(bvec);
-		memcpy(buff, data + offset, avail);
+	bio_for_each_segment_all_continue(bvec, bio, iter) {
+		unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+		memcpy(buff, bvec_virt(&bvec), avail);
 		buff += avail;
 		bytes -= avail;
-		offset = 0;
+		if (!bytes)
+			break;
 	}
 
 	res = LZ4_decompress_safe(stream->input, stream->output,
diff --git a/fs/squashfs/lzo_wrapper.c b/fs/squashfs/lzo_wrapper.c
index d216aeefa8..bccfcfa12e 100644
--- a/fs/squashfs/lzo_wrapper.c
+++ b/fs/squashfs/lzo_wrapper.c
@@ -66,21 +66,24 @@ static int lzo_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	struct squashfs_lzo *stream = strm;
 	void *buff = stream->input, *data;
 	int bytes = length, res;
 	size_t out_len = output->length;
 
-	while (bio_next_segment(bio, &iter_all)) {
-		int avail = min(bytes, ((int)bvec->bv_len) - offset);
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-		data = bvec_virt(bvec);
-		memcpy(buff, data + offset, avail);
+	bio_for_each_segment_all_continue(bvec, bio, iter) {
+		unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+		memcpy(buff, bvec_virt(&bvec), avail);
 		buff += avail;
 		bytes -= avail;
-		offset = 0;
+		if (!bytes)
+			break;
 	}
 
 	res = lzo1x_decompress_safe(stream->input, (size_t)length,
diff --git a/fs/squashfs/xz_wrapper.c b/fs/squashfs/xz_wrapper.c
index 6c49481a2f..6cf0e11e3b 100644
--- a/fs/squashfs/xz_wrapper.c
+++ b/fs/squashfs/xz_wrapper.c
@@ -120,8 +120,7 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 	int total = 0, error = 0;
 	struct squashfs_xz *stream = strm;
 
@@ -136,26 +135,28 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		enum xz_ret xz_err;
 
 		if (stream->buf.in_pos == stream->buf.in_size) {
-			const void *data;
-			int avail;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+			unsigned avail = min_t(unsigned, length, bvec.bv_len);
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				/* XZ_STREAM_END must be reached. */
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
 			length -= avail;
-			stream->buf.in = data + offset;
+			stream->buf.in = bvec_virt(&bvec);
 			stream->buf.in_size = avail;
 			stream->buf.in_pos = 0;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (stream->buf.out_pos == stream->buf.out_size) {
diff --git a/fs/squashfs/zlib_wrapper.c b/fs/squashfs/zlib_wrapper.c
index cbb7afe7bc..981ca5e410 100644
--- a/fs/squashfs/zlib_wrapper.c
+++ b/fs/squashfs/zlib_wrapper.c
@@ -53,8 +53,7 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 	int zlib_init = 0, error = 0;
 	z_stream *stream = strm;
 
@@ -67,25 +66,28 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		int zlib_err;
 
 		if (stream->avail_in == 0) {
-			const void *data;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
 			int avail;
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				/* Z_STREAM_END must be reached. */
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
+			avail = min_t(unsigned, length, bvec.bv_len);
 			length -= avail;
-			stream->next_in = data + offset;
+			stream->next_in = bvec_virt(&bvec);
 			stream->avail_in = avail;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (stream->avail_out == 0) {
diff --git a/fs/squashfs/zstd_wrapper.c b/fs/squashfs/zstd_wrapper.c
index 0e407c4d8b..658e5d462a 100644
--- a/fs/squashfs/zstd_wrapper.c
+++ b/fs/squashfs/zstd_wrapper.c
@@ -68,8 +68,7 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	int error = 0;
 	zstd_in_buffer in_buf = { NULL, 0, 0 };
 	zstd_out_buffer out_buf = { NULL, 0, 0 };
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 
 	stream = zstd_init_dstream(wksp->window_size, wksp->mem, wksp->mem_size);
 
@@ -85,25 +84,27 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		size_t zstd_err;
 
 		if (in_buf.pos == in_buf.size) {
-			const void *data;
-			int avail;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+			unsigned avail = min_t(unsigned, length, bvec.bv_len);
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
 			length -= avail;
-			in_buf.src = data + offset;
+			in_buf.src = bvec_virt(&bvec);
 			in_buf.size = avail;
 			in_buf.pos = 0;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (out_buf.pos == out_buf.size) {
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3536f28c05..f86c7190c3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -78,22 +78,40 @@ static inline void *bio_data(struct bio *bio)
 	return NULL;
 }
 
-static inline bool bio_next_segment(const struct bio *bio,
-				    struct bvec_iter_all *iter)
+static inline struct bio_vec bio_iter_all_peek(const struct bio *bio,
+					       struct bvec_iter_all *iter)
 {
-	if (iter->idx >= bio->bi_vcnt)
-		return false;
+	if (WARN_ON(iter->idx >= bio->bi_vcnt))
+		return (struct bio_vec) { NULL };
 
-	bvec_advance(&bio->bi_io_vec[iter->idx], iter);
-	return true;
+	return bvec_iter_all_peek(bio->bi_io_vec, iter);
+}
+
+static inline void bio_iter_all_advance(const struct bio *bio,
+					struct bvec_iter_all *iter,
+					unsigned bytes)
+{
+	bvec_iter_all_advance(bio->bi_io_vec, iter, bytes);
+
+	WARN_ON(iter->idx > bio->bi_vcnt ||
+		(iter->idx == bio->bi_vcnt && iter->done));
 }
 
+#define bio_for_each_segment_all_continue(bvl, bio, iter)		\
+	for (;								\
+	     iter.idx < bio->bi_vcnt &&					\
+		((bvl = bio_iter_all_peek(bio, &iter)), true);		\
+	     bio_iter_all_advance((bio), &iter, bvl.bv_len))
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
-#define bio_for_each_segment_all(bvl, bio, iter) \
-	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_segment_all(bvl, bio, iter)			\
+	for (bvec_iter_all_init(&iter);					\
+	     iter.idx < (bio)->bi_vcnt &&				\
+		((bvl = bio_iter_all_peek((bio), &iter)), true);		\
+	     bio_iter_all_advance((bio), &iter, bvl.bv_len))
 
 static inline void bio_advance_iter(const struct bio *bio,
 				    struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 555aae5448..635fb54143 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -85,12 +85,6 @@ struct bvec_iter {
 						   current bvec */
 } __packed;
 
-struct bvec_iter_all {
-	struct bio_vec	bv;
-	int		idx;
-	unsigned	done;
-};
-
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -184,7 +178,10 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
 		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
 	     bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
 
-/* for iterating one bio from start to end */
+/*
+ * bvec_iter_all: for advancing over a bio as it was originally created, but
+ * with the usual bio_for_each_segment interface - nonstandard, do not use:
+ */
 #define BVEC_ITER_ALL_INIT (struct bvec_iter)				\
 {									\
 	.bi_sector	= 0,						\
@@ -193,33 +190,45 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
 	.bi_bvec_done	= 0,						\
 }
 
-static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all)
+/*
+ * bvec_iter_all: for advancing over individual pages in a bio, as it was when
+ * it was first created:
+ */
+struct bvec_iter_all {
+	int		idx;
+	unsigned	done;
+};
+
+static inline void bvec_iter_all_init(struct bvec_iter_all *iter_all)
 {
 	iter_all->done = 0;
 	iter_all->idx = 0;
+}
 
-	return &iter_all->bv;
+static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
+						struct bvec_iter_all *iter)
+{
+	struct bio_vec bv = bvec[iter->idx];
+
+	bv.bv_offset	+= iter->done;
+	bv.bv_len	-= iter->done;
+
+	bv.bv_page	+= bv.bv_offset >> PAGE_SHIFT;
+	bv.bv_offset	&= ~PAGE_MASK;
+	bv.bv_len	= min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
+
+	return bv;
 }
 
-static inline void bvec_advance(const struct bio_vec *bvec,
-				struct bvec_iter_all *iter_all)
+static inline void bvec_iter_all_advance(const struct bio_vec *bvec,
+					 struct bvec_iter_all *iter,
+					 unsigned bytes)
 {
-	struct bio_vec *bv = &iter_all->bv;
-
-	if (iter_all->done) {
-		bv->bv_page++;
-		bv->bv_offset = 0;
-	} else {
-		bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
-		bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;
-	}
-	bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
-			   bvec->bv_len - iter_all->done);
-	iter_all->done += bv->bv_len;
+	iter->done += bytes;
 
-	if (iter_all->done == bvec->bv_len) {
-		iter_all->idx++;
-		iter_all->done = 0;
+	while (iter->done && iter->done >= bvec[iter->idx].bv_len) {
+		iter->done -= bvec[iter->idx].bv_len;
+		iter->idx++;
 	}
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 13/32] block: Rework bio_for_each_folio_all()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (11 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Matthew Wilcox, linux-block

This reimplements bio_for_each_folio_all() on top of the newly-reworked
bvec_iter_all, and since it's now trivial we also provide
bio_for_each_folio.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-block@vger.kernel.org
---
 fs/crypto/bio.c        |  9 +++--
 fs/iomap/buffered-io.c | 14 ++++---
 fs/verity/verify.c     |  9 +++--
 include/linux/bio.h    | 91 +++++++++++++++++++++---------------------
 include/linux/bvec.h   | 15 +++++--
 5 files changed, 75 insertions(+), 63 deletions(-)

diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index d57d0a020f..6469861add 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -30,11 +30,12 @@
  */
 bool fscrypt_decrypt_bio(struct bio *bio)
 {
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 
-	bio_for_each_folio_all(fi, bio) {
-		int err = fscrypt_decrypt_pagecache_blocks(fi.folio, fi.length,
-							   fi.offset);
+	bio_for_each_folio_all(fv, bio, iter) {
+		int err = fscrypt_decrypt_pagecache_blocks(fv.fv_folio, fv.fv_len,
+							   fv.fv_offset);
 
 		if (err) {
 			bio->bi_status = errno_to_blk_status(err);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6f4c97a6d7..60661c87d5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -187,10 +187,11 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 static void iomap_read_end_io(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 
-	bio_for_each_folio_all(fi, bio)
-		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
+	bio_for_each_folio_all(fv, bio, iter)
+		iomap_finish_folio_read(fv.fv_folio, fv.fv_offset, fv.fv_len, error);
 	bio_put(bio);
 }
 
@@ -1328,7 +1329,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 	u32 folio_count = 0;
 
 	for (bio = &ioend->io_inline_bio; bio; bio = next) {
-		struct folio_iter fi;
+		struct bvec_iter_all iter;
+		struct folio_vec fv;
 
 		/*
 		 * For the last bio, bi_private points to the ioend, so we
@@ -1340,8 +1342,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk all folios in bio, ending page IO on them */
-		bio_for_each_folio_all(fi, bio) {
-			iomap_finish_folio_write(inode, fi.folio, fi.length,
+		bio_for_each_folio_all(fv, bio, iter) {
+			iomap_finish_folio_write(inode, fv.fv_folio, fv.fv_len,
 					error);
 			folio_count++;
 		}
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index e250822275..b111ab0102 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -340,7 +340,8 @@ void fsverity_verify_bio(struct bio *bio)
 	struct inode *inode = bio_first_page_all(bio)->mapping->host;
 	struct fsverity_info *vi = inode->i_verity_info;
 	struct ahash_request *req;
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 	unsigned long max_ra_pages = 0;
 
 	/* This allocation never fails, since it's mempool-backed. */
@@ -359,9 +360,9 @@ void fsverity_verify_bio(struct bio *bio)
 		max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2);
 	}
 
-	bio_for_each_folio_all(fi, bio) {
-		if (!verify_data_blocks(inode, vi, req, fi.folio, fi.length,
-					fi.offset, max_ra_pages)) {
+	bio_for_each_folio_all(fv, bio, iter) {
+		if (!verify_data_blocks(inode, vi, req, fv.fv_folio, fv.fv_len,
+					fv.fv_offset, max_ra_pages)) {
 			bio->bi_status = BLK_STS_IOERR;
 			break;
 		}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f86c7190c3..7ced281734 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -169,6 +169,42 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
 #define bio_for_each_segment(bvl, bio, iter)				\
 	__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+struct folio_vec {
+	struct folio	*fv_folio;
+	size_t		fv_offset;
+	size_t		fv_len;
+};
+
+static inline struct folio_vec biovec_to_foliovec(struct bio_vec bv)
+{
+
+	struct folio *folio	= page_folio(bv.bv_page);
+	size_t offset		= (folio_page_idx(folio, bv.bv_page) << PAGE_SHIFT) +
+		bv.bv_offset;
+	size_t len = min_t(size_t, folio_size(folio) - offset, bv.bv_len);
+
+	return (struct folio_vec) {
+		.fv_folio	= folio,
+		.fv_offset	= offset,
+		.fv_len		= len,
+	};
+}
+
+static inline struct folio_vec bio_iter_iovec_folio(struct bio *bio,
+						    struct bvec_iter iter)
+{
+	return biovec_to_foliovec(bio_iter_iovec(bio, iter));
+}
+
+#define __bio_for_each_folio(bvl, bio, iter, start)			\
+	for (iter = (start);						\
+	     (iter).bi_size &&						\
+		((bvl = bio_iter_iovec_folio((bio), (iter))), 1);	\
+	     bio_advance_iter_single((bio), &(iter), (bvl).fv_len))
+
+#define bio_for_each_folio(bvl, bio, iter)				\
+	__bio_for_each_folio(bvl, bio, iter, (bio)->bi_iter)
+
 #define __bio_for_each_bvec(bvl, bio, iter, start)		\
 	for (iter = (start);						\
 	     (iter).bi_size &&						\
@@ -277,59 +313,22 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
 	return &bio->bi_io_vec[bio->bi_vcnt - 1];
 }
 
-/**
- * struct folio_iter - State for iterating all folios in a bio.
- * @folio: The current folio we're iterating.  NULL after the last folio.
- * @offset: The byte offset within the current folio.
- * @length: The number of bytes in this iteration (will not cross folio
- *	boundary).
- */
-struct folio_iter {
-	struct folio *folio;
-	size_t offset;
-	size_t length;
-	/* private: for use by the iterator */
-	struct folio *_next;
-	size_t _seg_count;
-	int _i;
-};
-
-static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
-				   int i)
-{
-	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
-
-	fi->folio = page_folio(bvec->bv_page);
-	fi->offset = bvec->bv_offset +
-			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
-	fi->_seg_count = bvec->bv_len;
-	fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
-	fi->_next = folio_next(fi->folio);
-	fi->_i = i;
-}
-
-static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
+static inline struct folio_vec bio_folio_iter_all_peek(const struct bio *bio,
+						       const struct bvec_iter_all *iter)
 {
-	fi->_seg_count -= fi->length;
-	if (fi->_seg_count) {
-		fi->folio = fi->_next;
-		fi->offset = 0;
-		fi->length = min(folio_size(fi->folio), fi->_seg_count);
-		fi->_next = folio_next(fi->folio);
-	} else if (fi->_i + 1 < bio->bi_vcnt) {
-		bio_first_folio(fi, bio, fi->_i + 1);
-	} else {
-		fi->folio = NULL;
-	}
+	return biovec_to_foliovec(__bvec_iter_all_peek(bio->bi_io_vec, iter));
 }
 
 /**
  * bio_for_each_folio_all - Iterate over each folio in a bio.
- * @fi: struct folio_iter which is updated for each folio.
+ * @fi: struct bio_folio_iter_all which is updated for each folio.
  * @bio: struct bio to iterate over.
  */
-#define bio_for_each_folio_all(fi, bio)				\
-	for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
+#define bio_for_each_folio_all(fv, bio, iter)				\
+	for (bvec_iter_all_init(&iter);					\
+	     iter.idx < bio->bi_vcnt &&					\
+		((fv = bio_folio_iter_all_peek(bio, &iter)), true);	\
+	     bio_iter_all_advance((bio), &iter, fv.fv_len))
 
 enum bip_flags {
 	BIP_BLOCK_INTEGRITY	= 1 << 0, /* block layer owns integrity data */
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 635fb54143..d238f959e3 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -205,18 +205,27 @@ static inline void bvec_iter_all_init(struct bvec_iter_all *iter_all)
 	iter_all->idx = 0;
 }
 
-static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
-						struct bvec_iter_all *iter)
+static inline struct bio_vec __bvec_iter_all_peek(const struct bio_vec *bvec,
+						  const struct bvec_iter_all *iter)
 {
 	struct bio_vec bv = bvec[iter->idx];
 
+	BUG_ON(iter->done >= bv.bv_len);
+
 	bv.bv_offset	+= iter->done;
 	bv.bv_len	-= iter->done;
 
 	bv.bv_page	+= bv.bv_offset >> PAGE_SHIFT;
 	bv.bv_offset	&= ~PAGE_MASK;
-	bv.bv_len	= min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
+	return bv;
+}
+
+static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
+						const struct bvec_iter_all *iter)
+{
+	struct bio_vec bv = __bvec_iter_all_peek(bvec, iter);
 
+	bv.bv_len = min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
 	return bv;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 14/32] block: Don't block on s_umount from __invalidate_super()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (12 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

__invalidate_super() is used to flush any filesystem mounted on a
device, generally on some sort of media change event.

However, when unmounting a filesystem and closing the underlying block
devices, we can deadlock if the block driver then calls
__invalidate_device() (e.g. because the block device goes away when it
is no longer in use).

This happens with bcachefs on top of loopback, and can be triggered by
fstests generic/042:

  put_super
    -> blkdev_put
    -> lo_release
    -> disk_force_media_change
    -> __invalidate_device
    -> get_super

This isn't inherently specific to bcachefs - it hasn't shown up with
other filesystems before because most other filesystems use the sget()
mechanism for opening/closing block devices (and enforcing exclusion),
however sget() has its own downsides and weird/sketchy behaviour w.r.t.
block device open lifetime - if that ever gets fixed more code will run
into this issue.

The __invalidate_device() call here is really a best effort "I just
yanked the device for a mounted filesystem, please try not to lose my
data" - if it's ever actually needed the user has already done something
crazy, and we probably shouldn't make things worse by deadlocking.
Switching to a trylock seems in keeping with what the code is trying to
do.

If we ever get revoke() at the block layer, perhaps we would look at
rearchitecting to use that instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/bdev.c       |  2 +-
 fs/super.c         | 40 +++++++++++++++++++++++++++++++---------
 include/linux/fs.h |  1 +
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 1795c7d4b9..743e969b7b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -922,7 +922,7 @@ EXPORT_SYMBOL(lookup_bdev);
 
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
-	struct super_block *sb = get_super(bdev);
+	struct super_block *sb = try_get_super(bdev);
 	int res = 0;
 
 	if (sb) {
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7d..a2decce02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -791,14 +791,7 @@ void iterate_supers_type(struct file_system_type *type,
 
 EXPORT_SYMBOL(iterate_supers_type);
 
-/**
- * get_super - get the superblock of a device
- * @bdev: device to get the superblock for
- *
- * Scans the superblock list and finds the superblock of the file system
- * mounted on the device given. %NULL is returned if no match is found.
- */
-struct super_block *get_super(struct block_device *bdev)
+static struct super_block *__get_super(struct block_device *bdev, bool try)
 {
 	struct super_block *sb;
 
@@ -813,7 +806,12 @@ struct super_block *get_super(struct block_device *bdev)
 		if (sb->s_bdev == bdev) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
-			down_read(&sb->s_umount);
+
+			if (!try)
+				down_read(&sb->s_umount);
+			else if (!down_read_trylock(&sb->s_umount))
+				return NULL;
+
 			/* still alive? */
 			if (sb->s_root && (sb->s_flags & SB_BORN))
 				return sb;
@@ -828,6 +826,30 @@ struct super_block *get_super(struct block_device *bdev)
 	return NULL;
 }
 
+/**
+ * get_super - get the superblock of a device
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, false);
+}
+
+/**
+ * try_get_super - get the superblock of a device, using trylock on sb->s_umount
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *try_get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, true);
+}
+
 /**
  * get_active_super - get an active reference to the superblock of a device
  * @bdev: device to get the superblock for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7..1a6f951942 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,7 @@ extern struct file_system_type *get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
+extern struct super_block *try_get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 15/32] bcache: move closures to lib/
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (13 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  1:10   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Coly Li

From: Kent Overstreet <kent.overstreet@gmail.com>

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/Kconfig                     | 10 +-----
 drivers/md/bcache/Makefile                    |  4 +--
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 .../md/bcache => include/linux}/closure.h     | 17 +++++----
 lib/Kconfig                                   |  3 ++
 lib/Kconfig.debug                             |  9 +++++
 lib/Makefile                                  |  2 ++
 {drivers/md/bcache => lib}/closure.c          | 35 +++++++++----------
 10 files changed, 43 insertions(+), 43 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (88%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 529c9d04e9..b2d10063d3 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -4,6 +4,7 @@ config BCACHE
 	tristate "Block device as cache"
 	select BLOCK_HOLDER_DEPRECATED if SYSFS
 	select CRC64
+	select CLOSURES
 	help
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -19,15 +20,6 @@ config BCACHE_DEBUG
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
 
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	help
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
-
 config BCACHE_ASYNC_REGISTRATION
 	bool "Asynchronous device registration"
 	depends on BCACHE
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676..054e8a33a7 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,6 +2,6 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
 	util.o writeback.o features.o
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index aebb7ef10e..c8b4914ad8 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -179,6 +179,7 @@
 #define pr_fmt(fmt) "bcache: %s() " fmt, __func__
 
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -192,7 +193,6 @@
 #include "bcache_ondisk.h"
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ba3909bb6b..31b68a1b87 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2912,7 +2912,6 @@ static int __init bcache_init(void)
 		goto err;
 
 	bch_debug_init();
-	closure_debug_init();
 
 	bcache_is_reboot = false;
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 6f3cb7c921..f61ab1bada 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
 #include <linux/workqueue.h>
 #include <linux/crc64.h>
 
-#include "closure.h"
-
 struct closure;
 
 #ifdef CONFIG_BCACHE_DEBUG
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index c88cdc4ae4..0ec9e7bc8d 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		BUG_ON(!queue_work(wq, &cl->work));
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index ce2abffb9e..1aa1c15a83 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -504,6 +504,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 39d1d93164..3dba7a9aff 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
 	  This is a relatively cheap check but if you care about maximum
 	  performance, say N.
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	help
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config BUG_ON_DATA_CORRUPTION
 	bool "Trigger a BUG when data corruption is detected"
 	select DEBUG_LIST
diff --git a/lib/Makefile b/lib/Makefile
index baf2821f7a..fd13ca6e0e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -245,6 +245,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
 
 obj-$(CONFIG_GLOB) += glob.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 88%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index d8d9394a6b..b38ded00b9 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -45,6 +44,7 @@ void closure_sub(struct closure *cl, int v)
 {
 	closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
 }
+EXPORT_SYMBOL(closure_sub);
 
 /*
  * closure_put - decrement a closure's refcount
@@ -53,6 +53,7 @@ void closure_put(struct closure *cl)
 {
 	closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
 }
+EXPORT_SYMBOL(closure_put);
 
 /*
  * closure_wake_up - wake up all closures on a wait list, without memory barrier
@@ -74,6 +75,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
 		closure_sub(cl, CLOSURE_WAITING + 1);
 	}
 }
+EXPORT_SYMBOL(__closure_wake_up);
 
 /**
  * closure_wait - add a closure to a waitlist
@@ -93,6 +95,7 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
 
 	return true;
 }
+EXPORT_SYMBOL(closure_wait);
 
 struct closure_syncer {
 	struct task_struct	*task;
@@ -127,8 +130,9 @@ void __sched __closure_sync(struct closure *cl)
 
 	__set_current_state(TASK_RUNNING);
 }
+EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -144,6 +148,7 @@ void closure_debug_create(struct closure *cl)
 	list_add(&cl->all, &closure_list);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
+EXPORT_SYMBOL(closure_debug_create);
 
 void closure_debug_destroy(struct closure *cl)
 {
@@ -156,8 +161,7 @@ void closure_debug_destroy(struct closure *cl)
 	list_del(&cl->all);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
-
-static struct dentry *closure_debug;
+EXPORT_SYMBOL(closure_debug_destroy);
 
 static int debug_show(struct seq_file *f, void *data)
 {
@@ -181,7 +185,7 @@ static int debug_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pS\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -190,18 +194,11 @@ static int debug_show(struct seq_file *f, void *data)
 
 DEFINE_SHOW_ATTRIBUTE(debug);
 
-void  __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
-	if (!IS_ERR_OR_NULL(bcache_debug))
-		/*
-		 * it is unnecessary to check return value of
-		 * debugfs_create_file(), we should not care
-		 * about this.
-		 */
-		closure_debug = debugfs_create_file(
-			"closures", 0400, bcache_debug, NULL, &debug_fops);
+	debugfs_create_file("closures", 0400, NULL, NULL, &debug_fops);
+	return 0;
 }
-#endif
+late_initcall(closure_debug_init)
 
-MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
-MODULE_LICENSE("GPL");
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (14 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 17:05   ` Coly Li
  2023-05-09 21:03   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
                   ` (16 subsequent siblings)
  32 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet, Coly Li

closures, from bcache, are async widgets with a variety of uses.
bcachefs also uses them, so they're being moved to lib/; mark them as
maintained.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Coly Li <colyli@suse.de>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3fc37de3d6..5d76169140 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5044,6 +5044,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
 F:	Documentation/devicetree/bindings/timer/
 F:	drivers/clocksource/
 
+CLOSURES:
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/closure.h
+F:	lib/closure.c
+
 CMPC ACPI DRIVER
 M:	Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
 M:	Daniel Oliveira Nascimento <don@syst.com.br>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 17/32] closures: closure_wait_event()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (15 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Coly Li, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

Like wait_event() - except, because it uses closures and closure
waitlists it doesn't have the restriction on modifying task state inside
the condition check, like wait_event() does.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 0ec9e7bc8d..36b4a83f9b 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -374,4 +374,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 18/32] closures: closure_nr_remaining()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (16 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

Factor out a new helper, which returns the number of events outstanding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/closure.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 36b4a83f9b..722a586bb2 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -172,6 +172,11 @@ void __closure_wake_up(struct closure_waitlist *list);
 bool closure_wait(struct closure_waitlist *list, struct closure *cl);
 void __closure_sync(struct closure *cl);
 
+static inline unsigned closure_nr_remaining(struct closure *cl)
+{
+	return atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK;
+}
+
 /**
  * closure_sync - sleep until a closure a closure has nothing left to wait on
  *
@@ -180,7 +185,7 @@ void __closure_sync(struct closure *cl);
  */
 static inline void closure_sync(struct closure *cl)
 {
-	if ((atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK) != 1)
+	if (closure_nr_remaining(cl) != 1)
 		__closure_sync(cl);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 19/32] closures: Add a missing include
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (17 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

Fixes building in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/closure.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/closure.c b/lib/closure.c
index b38ded00b9..0855e698ce 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -9,6 +9,7 @@
 #include <linux/closure.h>
 #include <linux/debugfs.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (18 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-23  9:27   ` (subset) " Christian Brauner
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
                   ` (12 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Kent Overstreet

From: Dave Chinner <dchinner@redhat.com>

In preparation for changing the inode hash table implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/inode.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f13..41a10bcda1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,22 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
+	return tmp & i_hash_mask;
+}
+
+static inline struct hlist_head *i_hash_head(struct super_block *sb,
+		unsigned int hashval)
+{
+	return inode_hashtable + hash(sb, hashval);
+}
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -506,16 +522,6 @@ static inline void inode_sb_list_del(struct inode *inode)
 	}
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
-	return tmp & i_hash_mask;
-}
-
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -1163,7 +1169,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 			    int (*test)(struct inode *, void *),
 			    int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
 	struct inode *old;
 
 again:
@@ -1267,7 +1273,7 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
 	spin_lock(&inode_hash_lock);
@@ -1335,7 +1341,7 @@ EXPORT_SYMBOL(iget_locked);
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 
 	hlist_for_each_entry_rcu(inode, b, i_hash) {
@@ -1422,7 +1428,7 @@ EXPORT_SYMBOL(igrab);
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode;
 
 	spin_lock(&inode_hash_lock);
@@ -1477,7 +1483,7 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
 	spin_lock(&inode_hash_lock);
@@ -1526,7 +1532,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
 					     void *),
 				void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode, *ret_inode = NULL;
 	int mval;
 
@@ -1571,7 +1577,7 @@ EXPORT_SYMBOL(find_inode_nowait);
 struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
 			     int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1609,7 +1615,7 @@ EXPORT_SYMBOL(find_inode_rcu);
 struct inode *find_inode_by_ino_rcu(struct super_block *sb,
 				    unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1629,7 +1635,7 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 
 	while (1) {
 		struct inode *old = NULL;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 21/32] hlist-bl: add hlist_bl_fake()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (19 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  4:48   ` Dave Chinner
  2023-05-23  9:27   ` (subset) " Christian Brauner
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
                   ` (11 subsequent siblings)
  32 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Dave Chinner, Kent Overstreet

From: Dave Chinner <dchinner@redhat.com>

in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/list_bl.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index ae1b541446..8ee2bf5af1 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -143,6 +143,28 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 	}
 }
 
+/**
+ * hlist_bl_add_fake - create a fake list consisting of a single headless node
+ * @n: Node to make a fake list out of
+ *
+ * This makes @n appear to be its own predecessor on a headless hlist.
+ * The point of this is to allow things like hlist_bl_del() to work correctly
+ * in cases where there is no list.
+ */
+static inline void hlist_bl_add_fake(struct hlist_bl_node *n)
+{
+	n->pprev = &n->next;
+}
+
+/**
+ * hlist_fake: Is this node a fake hlist_bl?
+ * @h: Node to check for being a self-referential fake hlist.
+ */
+static inline bool hlist_bl_fake(struct hlist_bl_node *n)
+{
+	return n->pprev == &n->next;
+}
+
 static inline void hlist_bl_lock(struct hlist_bl_head *b)
 {
 	bit_spin_lock(0, (unsigned long *)b);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (20 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  4:45   ` Dave Chinner
  2023-05-23  9:28   ` (subset) " Christian Brauner
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
                   ` (10 subsequent siblings)
  32 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Kent Overstreet

From: Dave Chinner <dchinner@redhat.com>

Because scalability of the global inode_hash_lock really, really
sucks.

32-way concurrent create on a couple of different filesystems
before:

-   52.13%     0.04%  [kernel]            [k] ext4_create
   - 52.09% ext4_create
      - 41.03% __ext4_new_inode
         - 29.92% insert_inode_locked
            - 25.35% _raw_spin_lock
               - do_raw_spin_lock
                  - 24.97% __pv_queued_spin_lock_slowpath

-   72.33%     0.02%  [kernel]            [k] do_filp_open
   - 72.31% do_filp_open
      - 72.28% path_openat
         - 57.03% bch2_create
            - 56.46% __bch2_create
               - 40.43% inode_insert5
                  - 36.07% _raw_spin_lock
                     - do_raw_spin_lock
                          35.86% __pv_queued_spin_lock_slowpath
                    4.02% find_inode

Convert the inode hash table to a RCU-aware hash-bl table just like
the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because the hash value that is used to hash the inode is
generated from the inode itself - filesystems can provide this
themselves so we have to either store the hash or the head pointer
in the inode to be able to find the right list head for removal...

Same workload after:

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/inode.c         | 200 ++++++++++++++++++++++++++++-----------------
 include/linux/fs.h |   9 +-
 2 files changed, 132 insertions(+), 77 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 41a10bcda1..d446b054ec 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -57,8 +57,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
-static struct hlist_head *inode_hashtable __read_mostly;
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
@@ -70,7 +69,7 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
 	return tmp & i_hash_mask;
 }
 
-static inline struct hlist_head *i_hash_head(struct super_block *sb,
+static inline struct hlist_bl_head *i_hash_head(struct super_block *sb,
 		unsigned int hashval)
 {
 	return inode_hashtable + hash(sb, hashval);
@@ -433,7 +432,7 @@ EXPORT_SYMBOL(address_space_init_once);
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_wb_list);
@@ -522,6 +521,17 @@ static inline void inode_sb_list_del(struct inode *inode)
 	}
 }
 
+/*
+ * Ensure that we store the hash head in the inode when we insert the inode into
+ * the hlist_bl_head...
+ */
+static inline void
+__insert_inode_hash_head(struct inode *inode, struct hlist_bl_head *b)
+{
+	hlist_bl_add_head_rcu(&inode->i_hash, b);
+	inode->i_hash_head = b;
+}
+
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -532,13 +542,13 @@ static inline void inode_sb_list_del(struct inode *inode)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
 
-	spin_lock(&inode_hash_lock);
+	hlist_bl_lock(b);
 	spin_lock(&inode->i_lock);
-	hlist_add_head_rcu(&inode->i_hash, b);
+	__insert_inode_hash_head(inode, b);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -550,11 +560,44 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void __remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_hash_lock);
-	spin_lock(&inode->i_lock);
-	hlist_del_init_rcu(&inode->i_hash);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	struct hlist_bl_head *b = inode->i_hash_head;
+
+	/*
+	 * There are some callers that come through here without synchronisation
+	 * and potentially with multiple references to the inode. Hence we have
+	 * to handle the case that we might race with a remove and insert to a
+	 * different list. Coda, in particular, seems to have a userspace API
+	 * that can directly trigger "unhash/rehash to different list" behaviour
+	 * without any serialisation at all.
+	 *
+	 * Hence we have to handle the situation where the inode->i_hash_head
+	 * might point to a different list than what we expect, indicating that
+	 * we raced with another unhash and potentially a new insertion. This
+	 * means we have to retest the head once we have everything locked up
+	 * and loop again if it doesn't match.
+	 */
+	while (b) {
+		hlist_bl_lock(b);
+		spin_lock(&inode->i_lock);
+		if (b != inode->i_hash_head) {
+			hlist_bl_unlock(b);
+			b = inode->i_hash_head;
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		/*
+		 * Need to set the pprev pointer to NULL after list removal so
+		 * that both RCU traversals and hlist_bl_unhashed() work
+		 * correctly at this point.
+		 */
+		hlist_bl_del_rcu(&inode->i_hash);
+		inode->i_hash.pprev = NULL;
+		inode->i_hash_head = NULL;
+		spin_unlock(&inode->i_lock);
+		hlist_bl_unlock(b);
+		break;
+	}
+
 }
 EXPORT_SYMBOL(__remove_inode_hash);
 
@@ -904,26 +947,28 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 	return freed;
 }
 
-static void __wait_on_freeing_inode(struct inode *inode);
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+				struct inode *inode);
 /*
  * Called with the inode lock held.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			__wait_on_freeing_inode(inode);
+			__wait_on_freeing_inode(b, inode);
 			goto repeat;
 		}
 		if (unlikely(inode->i_state & I_CREATING)) {
@@ -942,19 +987,20 @@ static struct inode *find_inode(struct super_block *sb,
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_bl_head *b, unsigned long ino)
 {
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			__wait_on_freeing_inode(inode);
+			__wait_on_freeing_inode(b, inode);
 			goto repeat;
 		}
 		if (unlikely(inode->i_state & I_CREATING)) {
@@ -1162,25 +1208,25 @@ EXPORT_SYMBOL(unlock_two_nondirectories);
  * return it locked, hashed, and with the I_NEW flag set. The file system gets
  * to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_hash_lock held, so can't
- * sleep.
+ * Note both @test and @set are called with the inode hash chain lock held,
+ * so can't sleep.
  */
 struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 			    int (*test)(struct inode *, void *),
 			    int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
 	struct inode *old;
 
 again:
-	spin_lock(&inode_hash_lock);
-	old = find_inode(inode->i_sb, head, test, data);
+	hlist_bl_lock(b);
+	old = find_inode(inode->i_sb, b, test, data);
 	if (unlikely(old)) {
 		/*
 		 * Uhhuh, somebody else created the same inode under us.
 		 * Use the old inode instead of the preallocated one.
 		 */
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		if (IS_ERR(old))
 			return NULL;
 		wait_on_inode(old);
@@ -1202,7 +1248,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 	 */
 	spin_lock(&inode->i_lock);
 	inode->i_state |= I_NEW;
-	hlist_add_head_rcu(&inode->i_hash, head);
+	__insert_inode_hash_head(inode, b);
 	spin_unlock(&inode->i_lock);
 
 	/*
@@ -1212,7 +1258,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 	if (list_empty(&inode->i_sb_list))
 		inode_sb_list_add(inode);
 unlock:
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 
 	return inode;
 }
@@ -1273,12 +1319,12 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode_fast(sb, b, ino);
+	hlist_bl_unlock(b);
 	if (inode) {
 		if (IS_ERR(inode))
 			return NULL;
@@ -1294,17 +1340,17 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_hash_lock);
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head_rcu(&inode->i_hash, head);
+			__insert_inode_hash_head(inode, b);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1317,7 +1363,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		destroy_inode(inode);
 		if (IS_ERR(old))
 			return NULL;
@@ -1341,10 +1387,11 @@ EXPORT_SYMBOL(iget_locked);
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry_rcu(inode, b, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_ino == ino && inode->i_sb == sb)
 			return 0;
 	}
@@ -1428,12 +1475,12 @@ EXPORT_SYMBOL(igrab);
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode(sb, b, test, data);
+	hlist_bl_unlock(b);
 
 	return IS_ERR(inode) ? NULL : inode;
 }
@@ -1483,12 +1530,12 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode_fast(sb, b, ino);
+	hlist_bl_unlock(b);
 
 	if (inode) {
 		if (IS_ERR(inode))
@@ -1532,12 +1579,13 @@ struct inode *find_inode_nowait(struct super_block *sb,
 					     void *),
 				void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
+	struct hlist_bl_node *node;
 	struct inode *inode, *ret_inode = NULL;
 	int mval;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		mval = match(inode, hashval, data);
@@ -1548,7 +1596,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
 		goto out;
 	}
 out:
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 	return ret_inode;
 }
 EXPORT_SYMBOL(find_inode_nowait);
@@ -1577,13 +1625,14 @@ EXPORT_SYMBOL(find_inode_nowait);
 struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
 			     int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 			 "suspicious find_inode_rcu() usage");
 
-	hlist_for_each_entry_rcu(inode, head, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_sb == sb &&
 		    !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
 		    test(inode, data))
@@ -1615,13 +1664,14 @@ EXPORT_SYMBOL(find_inode_rcu);
 struct inode *find_inode_by_ino_rcu(struct super_block *sb,
 				    unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 			 "suspicious find_inode_by_ino_rcu() usage");
 
-	hlist_for_each_entry_rcu(inode, head, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_ino == ino &&
 		    inode->i_sb == sb &&
 		    !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
@@ -1635,39 +1685,42 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 
 	while (1) {
-		struct inode *old = NULL;
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, head, i_hash) {
-			if (old->i_ino != ino)
+		struct hlist_bl_node *node;
+		struct inode *old = NULL, *t;
+
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(t, node, b, i_hash) {
+			if (t->i_ino != ino)
 				continue;
-			if (old->i_sb != sb)
+			if (t->i_sb != sb)
 				continue;
-			spin_lock(&old->i_lock);
-			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
-				spin_unlock(&old->i_lock);
+			spin_lock(&t->i_lock);
+			if (t->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&t->i_lock);
 				continue;
 			}
+			old = t;
 			break;
 		}
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW | I_CREATING;
-			hlist_add_head_rcu(&inode->i_hash, head);
+			__insert_inode_hash_head(inode, b);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 			return 0;
 		}
 		if (unlikely(old->i_state & I_CREATING)) {
 			spin_unlock(&old->i_lock);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 			return -EBUSY;
 		}
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -2192,17 +2245,18 @@ EXPORT_SYMBOL(inode_needs_sync);
  * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
  * will DTRT.
  */
-static void __wait_on_freeing_inode(struct inode *inode)
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+				struct inode *inode)
 {
 	wait_queue_head_t *wq;
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 	schedule();
 	finish_wait(wq, &wait.wq_entry);
-	spin_lock(&inode_hash_lock);
+	hlist_bl_lock(b);
 }
 
 static __initdata unsigned long ihash_entries;
@@ -2228,7 +2282,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_EARLY | HASH_ZERO,
@@ -2254,7 +2308,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_ZERO,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a6f951942..db8d49cbf7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -647,7 +647,8 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 	unsigned long		dirtied_time_when;
 
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
+	struct hlist_bl_head	*i_hash_head;
 	struct list_head	i_io_list;	/* backing dev IO list */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback	*i_wb;		/* the associated cgroup wb */
@@ -713,7 +714,7 @@ static inline unsigned int i_blocksize(const struct inode *node)
 
 static inline int inode_unhashed(struct inode *inode)
 {
-	return hlist_unhashed(&inode->i_hash);
+	return hlist_bl_unhashed(&inode->i_hash);
 }
 
 /*
@@ -724,7 +725,7 @@ static inline int inode_unhashed(struct inode *inode)
  */
 static inline void inode_fake_hash(struct inode *inode)
 {
-	hlist_add_fake(&inode->i_hash);
+	hlist_bl_add_fake(&inode->i_hash);
 }
 
 /*
@@ -2695,7 +2696,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
+	if (!inode_unhashed(inode) && !hlist_bl_fake(&inode->i_hash))
 		__remove_inode_hash(inode);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (21 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  2:20   ` kernel test robot
  2023-05-11  2:08   ` kernel test robot
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
                   ` (9 subsequent siblings)
  32 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Alexander Viro, Matthew Wilcox

Add a foliated version of copy_page_from_iter_atomic()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 53 ++++++++++++++++++++++++++++++++++++---------
 2 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 27e3fd9429..b2c281cb10 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -154,6 +154,8 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 
 size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 				  size_t bytes, struct iov_iter *i);
+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+				   size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 274014e4ea..27ba7e9f9e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -800,18 +800,10 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+static inline size_t __copy_page_from_iter_atomic(struct page *page, unsigned offset,
+						  size_t bytes, struct iov_iter *i)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
-		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
-		return 0;
-	}
 	iterate_and_advance(i, bytes, base, len, off,
 		copyin(p + off, base, len),
 		memcpy(p + off, base, len)
@@ -819,8 +811,49 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 	kunmap_atomic(kaddr);
 	return bytes;
 }
+
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
+				  struct iov_iter *i)
+{
+	if (!page_copy_sane(page, offset, bytes))
+		return 0;
+	if (WARN_ON_ONCE(!i->data_source))
+		return 0;
+	return __copy_page_from_iter_atomic(page, offset, bytes, i);
+}
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+				   size_t bytes, struct iov_iter *i)
+{
+	size_t ret = 0;
+
+	if (WARN_ON(offset + bytes > folio_size(folio)))
+		return 0;
+	if (WARN_ON_ONCE(!i->data_source))
+		return 0;
+
+#ifdef CONFIG_HIGHMEM
+	while (bytes) {
+		struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
+		unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
+		unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
+
+		offset	+= r;
+		bytes	-= r;
+		ret	+= r;
+
+		if (r != b)
+			break;
+	}
+#else
+	ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
+#endif
+
+	return ret;
+}
+EXPORT_SYMBOL(copy_folio_from_iter_atomic);
+
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (22 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 21:03   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

lib/generic-radix-tree.c is a simple radix tree that supports storing
arbitrary types. Add a maintainers entry for it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5d76169140..c550f5909e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8615,6 +8615,13 @@ F:	Documentation/devicetree/bindings/power/power?domain*
 F:	drivers/base/power/domain*.c
 F:	include/linux/pm_domain.h
 
+GENERIC RADIX TREE:
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/generic-radix-tree.h
+F:	lib/generic-radix-tree.c
+
 GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
 M:	Eugen Hristev <eugen.hristev@microchip.com>
 L:	linux-input@vger.kernel.org
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (23 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

When we started spreading new inode numbers throughout most of the 64
bit inode space, that triggered some corner case bugs, in particular
some integer overflows related to the radix tree code. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h |  6 ++++++
 lib/generic-radix-tree.c           | 17 ++++++++++++++---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 107613f7d7..63080822dc 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -184,6 +184,12 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
 {
+	if (iter->offset + obj_size < iter->offset) {
+		iter->offset	= SIZE_MAX;
+		iter->pos	= SIZE_MAX;
+		return;
+	}
+
 	iter->offset += obj_size;
 
 	if (!is_power_of_2(obj_size) &&
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index f25eb111c0..7dfa88282b 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -166,6 +166,10 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 	struct genradix_root *r;
 	struct genradix_node *n;
 	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
 restart:
 	r = READ_ONCE(radix->root);
 	if (!r)
@@ -184,10 +188,17 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 			(GENRADIX_ARY - 1);
 
 		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			if (iter->offset + objs_per_ptr < iter->offset) {
+				iter->offset	= SIZE_MAX;
+				iter->pos	= SIZE_MAX;
+				return NULL;
+			}
+
 			i++;
-			iter->offset = round_down(iter->offset +
-					   genradix_depth_size(level),
-					   genradix_depth_size(level));
+			iter->offset = round_down(iter->offset + objs_per_ptr,
+						  objs_per_ptr);
 			iter->pos = (iter->offset >> PAGE_SHIFT) *
 				objs_per_page;
 			if (i == GENRADIX_ARY)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (24 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

We now need linux/limits.h for SIZE_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 63080822dc..f6cd0f909d 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -38,6 +38,7 @@
 
 #include <asm/page.h>
 #include <linux/bug.h>
+#include <linux/limits.h>
 #include <linux/log2.h>
 #include <linux/math.h>
 #include <linux/types.h>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (25 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

This patch adds genradix_peek_prev(), genradix_iter_rewind(), and
genradix_for_each_reverse(), for iterating backwards over a generic
radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h | 61 +++++++++++++++++++++++++++++-
 lib/generic-radix-tree.c           | 59 +++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index f6cd0f909d..c74b737699 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -117,6 +117,11 @@ static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
 
 #define __genradix_cast(_radix)		(typeof((_radix)->type[0]) *)
 #define __genradix_obj_size(_radix)	sizeof((_radix)->type[0])
+#define __genradix_objs_per_page(_radix)			\
+	(PAGE_SIZE / sizeof((_radix)->type[0]))
+#define __genradix_page_remainder(_radix)			\
+	(PAGE_SIZE % sizeof((_radix)->type[0]))
+
 #define __genradix_idx_to_offset(_radix, _idx)			\
 	__idx_to_offset(_idx, __genradix_obj_size(_radix))
 
@@ -180,7 +185,25 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 #define genradix_iter_peek(_iter, _radix)			\
 	(__genradix_cast(_radix)				\
 	 __genradix_iter_peek(_iter, &(_radix)->tree,		\
-			      PAGE_SIZE / __genradix_obj_size(_radix)))
+			__genradix_objs_per_page(_radix)))
+
+void *__genradix_iter_peek_prev(struct genradix_iter *, struct __genradix *,
+				size_t, size_t);
+
+/**
+ * genradix_iter_peek - get first entry at or below iterator's current
+ *			position
+ * @_iter:	a genradix_iter
+ * @_radix:	genradix being iterated over
+ *
+ * If no more entries exist at or below @_iter's current position, returns NULL
+ */
+#define genradix_iter_peek_prev(_iter, _radix)			\
+	(__genradix_cast(_radix)				\
+	 __genradix_iter_peek_prev(_iter, &(_radix)->tree,	\
+			__genradix_objs_per_page(_radix),	\
+			__genradix_obj_size(_radix) +		\
+			__genradix_page_remainder(_radix)))
 
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
@@ -203,6 +226,25 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_iter_advance(_iter, _radix)			\
 	__genradix_iter_advance(_iter, __genradix_obj_size(_radix))
 
+static inline void __genradix_iter_rewind(struct genradix_iter *iter,
+					  size_t obj_size)
+{
+	if (iter->offset == 0 ||
+	    iter->offset == SIZE_MAX) {
+		iter->offset = SIZE_MAX;
+		return;
+	}
+
+	if ((iter->offset & (PAGE_SIZE - 1)) == 0)
+		iter->offset -= PAGE_SIZE % obj_size;
+
+	iter->offset -= obj_size;
+	iter->pos--;
+}
+
+#define genradix_iter_rewind(_iter, _radix)			\
+	__genradix_iter_rewind(_iter, __genradix_obj_size(_radix))
+
 #define genradix_for_each_from(_radix, _iter, _p, _start)	\
 	for (_iter = genradix_iter_init(_radix, _start);	\
 	     (_p = genradix_iter_peek(&_iter, _radix)) != NULL;	\
@@ -220,6 +262,23 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_for_each(_radix, _iter, _p)			\
 	genradix_for_each_from(_radix, _iter, _p, 0)
 
+#define genradix_last_pos(_radix)				\
+	(SIZE_MAX / PAGE_SIZE * __genradix_objs_per_page(_radix) - 1)
+
+/**
+ * genradix_for_each_reverse - iterate over entry in a genradix, reverse order
+ * @_radix:	genradix to iterate over
+ * @_iter:	a genradix_iter to track current position
+ * @_p:		pointer to genradix entry type
+ *
+ * On every iteration, @_p will point to the current entry, and @_iter.pos
+ * will be the current entry's index.
+ */
+#define genradix_for_each_reverse(_radix, _iter, _p)		\
+	for (_iter = genradix_iter_init(_radix,	genradix_last_pos(_radix));\
+	     (_p = genradix_iter_peek_prev(&_iter, _radix)) != NULL;\
+	     genradix_iter_rewind(&_iter, _radix))
+
 int __genradix_prealloc(struct __genradix *, size_t, gfp_t);
 
 /**
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index 7dfa88282b..41f1bcdc44 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -1,4 +1,5 @@
 
+#include <linux/atomic.h>
 #include <linux/export.h>
 #include <linux/generic-radix-tree.h>
 #include <linux/gfp.h>
@@ -212,6 +213,64 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 }
 EXPORT_SYMBOL(__genradix_iter_peek);
 
+void *__genradix_iter_peek_prev(struct genradix_iter *iter,
+				struct __genradix *radix,
+				size_t objs_per_page,
+				size_t obj_size_plus_page_remainder)
+{
+	struct genradix_root *r;
+	struct genradix_node *n;
+	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
+restart:
+	r = READ_ONCE(radix->root);
+	if (!r)
+		return NULL;
+
+	n	= genradix_root_to_node(r);
+	level	= genradix_root_to_depth(r);
+
+	if (ilog2(iter->offset) >= genradix_depth_shift(level)) {
+		iter->offset = genradix_depth_size(level);
+		iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+		iter->offset -= obj_size_plus_page_remainder;
+		iter->pos--;
+	}
+
+	while (level) {
+		level--;
+
+		i = (iter->offset >> genradix_depth_shift(level)) &
+			(GENRADIX_ARY - 1);
+
+		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			iter->offset = round_down(iter->offset, objs_per_ptr);
+			iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+			if (!iter->offset)
+				return NULL;
+
+			iter->offset -= obj_size_plus_page_remainder;
+			iter->pos--;
+
+			if (!i)
+				goto restart;
+			--i;
+		}
+
+		n = n->children[i];
+	}
+
+	return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek_prev);
+
 static void genradix_free_recurse(struct genradix_node *n, unsigned level)
 {
 	if (level) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 28/32] stacktrace: Export stack_trace_save_tsk
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (26 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-06-19  9:10   ` Mark Rutland
  2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Christopher James Halse Rogers, Kent Overstreet

From: Christopher James Halse Rogers <raof@ubuntu.com>

The bcachefs module wants it, and there doesn't seem to be any
reason it shouldn't be exported like the other functions.

Signed-off-by: Christopher James Halse Rogers <raof@ubuntu.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 kernel/stacktrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9ed5ce9894..4f65824879 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
 	put_task_stack(tsk);
 	return c.len;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
@@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
 	save_stack_trace_tsk(task, &trace);
 	return trace.nr_entries;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (27 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-07-12 19:58   ` Kees Cook
  2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

printbuf now needs to know the number of characters that would have been
written if the buffer was too small, like snprintf(); this changes
string_get_size() to return the the return value of snprintf().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/string_helpers.h | 4 ++--
 lib/string_helpers.c           | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index fae6beaaa2..44148f8feb 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -23,8 +23,8 @@ enum string_size_units {
 	STRING_UNITS_2,		/* use binary powers of 2^10 */
 };
 
-void string_get_size(u64 size, u64 blk_size, enum string_size_units units,
-		     char *buf, int len);
+int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+		    char *buf, int len);
 
 int parse_int_array_user(const char __user *from, size_t count, int **array);
 
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e0..ca36ceba0e 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -32,8 +32,8 @@
  * at least 9 bytes and will always be zero terminated.
  *
  */
-void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
-		     char *buf, int len)
+int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+		    char *buf, int len)
 {
 	static const char *const units_10[] = {
 		"B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
@@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
 	else
 		unit = units_str[units][i];
 
-	snprintf(buf, len, "%u%s %s", (u32)size,
-		 tmp, unit);
+	return snprintf(buf, len, "%u%s %s", (u32)size,
+			tmp, unit);
 }
 EXPORT_SYMBOL(string_get_size);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 30/32] lib: Export errname
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (28 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Christopher James Halse Rogers

The bcachefs module now wants this and it seems sensible.

Signed-off-by: Christopher James Halse Rogers <raof@ubuntu.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/errname.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/errname.c b/lib/errname.c
index 67739b174a..dd1b998552 100644
--- a/lib/errname.c
+++ b/lib/errname.c
@@ -228,3 +228,4 @@ const char *errname(int err)
 
 	return err > 0 ? name + 1 : name;
 }
+EXPORT_SYMBOL(errname);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 31/32] lib: add mean and variance module.
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (29 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
  2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
  32 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Daniel Hill, Kent Overstreet

From: Daniel Hill <daniel@gluo.nz>

This module provides a fast 64bit implementation of basic statistics
functions, including mean, variance and standard deviation in both
weighted and unweighted variants, the unweighted variant has a 32bit
limitation per sample to prevent overflow when squaring.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS                       |   9 ++
 include/linux/mean_and_variance.h | 219 ++++++++++++++++++++++++++++++
 lib/Kconfig.debug                 |   9 ++
 lib/math/Kconfig                  |   3 +
 lib/math/Makefile                 |   2 +
 lib/math/mean_and_variance.c      | 136 +++++++++++++++++++
 lib/math/mean_and_variance_test.c | 155 +++++++++++++++++++++
 7 files changed, 533 insertions(+)
 create mode 100644 include/linux/mean_and_variance.h
 create mode 100644 lib/math/mean_and_variance.c
 create mode 100644 lib/math/mean_and_variance_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c550f5909e..dbf3c33c31 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12767,6 +12767,15 @@ F:	Documentation/devicetree/bindings/net/ieee802154/mcr20a.txt
 F:	drivers/net/ieee802154/mcr20a.c
 F:	drivers/net/ieee802154/mcr20a.h
 
+MEAN AND VARIANCE LIBRARY
+M:	Daniel B. Hill <daniel@gluo.nz>
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+S:	Maintained
+T:	git https://github.com/YellowOnion/linux/
+F:	include/linux/mean_and_variance.h
+F:	lib/math/mean_and_variance.c
+F:	lib/math/mean_and_variance_test.c
+
 MEASUREMENT COMPUTING CIO-DAC IIO DRIVER
 M:	William Breathitt Gray <william.gray@linaro.org>
 L:	linux-iio@vger.kernel.org
diff --git a/include/linux/mean_and_variance.h b/include/linux/mean_and_variance.h
new file mode 100644
index 0000000000..89540628e8
--- /dev/null
+++ b/include/linux/mean_and_variance.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef MEAN_AND_VARIANCE_H_
+#define MEAN_AND_VARIANCE_H_
+
+#include <linux/types.h>
+#include <linux/limits.h>
+#include <linux/math64.h>
+
+#define SQRT_U64_MAX 4294967295ULL
+
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+
+typedef unsigned __int128 u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+	return (u128)a;
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+	return (u64)a;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+	return (u64)(a >> 64);
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+	return a + b;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+	return a - b;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+	return i << shift;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+	return ((u128)a << 64) + b;
+}
+
+static inline u128 u128_square(u64 i)
+{
+	return i*i;
+}
+
+#else
+
+typedef struct {
+	u64 hi, lo;
+} u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+	return (u128){ .lo = a };
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+	return a.lo;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+	return a.hi;
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+	u128 c;
+
+	c.lo = a.lo + b.lo;
+	c.hi = a.hi + b.hi + (c.lo < a.lo);
+	return c;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+	u128 c;
+
+	c.lo = a.lo - b.lo;
+	c.hi = a.hi - b.hi - (c.lo > a.lo);
+	return c;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+	u128 r;
+
+	r.lo = i.lo << shift;
+	if (shift < 64)
+		r.hi = (i.hi << shift) | (i.lo >> (64 - shift));
+	else {
+		r.hi = i.lo << (shift - 64);
+		r.lo = 0;
+	}
+	return r;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+	return u128_add(u128_shl(u64_to_u128(a), 64), u64_to_u128(b));
+}
+
+static inline u128 u128_square(u64 i)
+{
+	u128 r;
+	u64  h = i >> 32, l = i & (u64)U32_MAX;
+
+	r =             u128_shl(u64_to_u128(h*h), 64);
+	r = u128_add(r, u128_shl(u64_to_u128(h*l), 32));
+	r = u128_add(r, u128_shl(u64_to_u128(l*h), 32));
+	r = u128_add(r,          u64_to_u128(l*l));
+	return r;
+}
+
+#endif
+
+static inline u128 u128_div(u128 n, u64 d)
+{
+	u128 r;
+	u64 rem;
+	u64 hi = u128_shr64_to_u64(n);
+	u64 lo = u128_to_u64(n);
+	u64  h =  hi & ((u64)U32_MAX  << 32);
+	u64  l = (hi &  (u64)U32_MAX) << 32;
+
+	r =             u128_shl(u64_to_u128(div64_u64_rem(h,                d, &rem)), 64);
+	r = u128_add(r, u128_shl(u64_to_u128(div64_u64_rem(l  + (rem << 32), d, &rem)), 32));
+	r = u128_add(r,          u64_to_u128(div64_u64_rem(lo + (rem << 32), d, &rem)));
+	return r;
+}
+
+struct mean_and_variance {
+	s64 n;
+	s64 sum;
+	u128 sum_squares;
+};
+
+/* expontentially weighted variant */
+struct mean_and_variance_weighted {
+	bool init;
+	u8 w;
+	s64 mean;
+	u64 variance;
+};
+
+/**
+ * fast_divpow2() - fast approximation for n / (1 << d)
+ * @n: numerator
+ * @d: the power of 2 denominator.
+ *
+ * note: this rounds towards 0.
+ */
+static inline s64 fast_divpow2(s64 n, u8 d)
+{
+	return (n + ((n < 0) ? ((1 << d) - 1) : 0)) >> d;
+}
+
+static inline struct mean_and_variance
+mean_and_variance_update_inlined(struct mean_and_variance s1, s64 v1)
+{
+	struct mean_and_variance s2;
+	u64 v2 = abs(v1);
+
+	s2.n           = s1.n + 1;
+	s2.sum         = s1.sum + v1;
+	s2.sum_squares = u128_add(s1.sum_squares, u128_square(v2));
+	return s2;
+}
+
+static inline struct mean_and_variance_weighted
+mean_and_variance_weighted_update_inlined(struct mean_and_variance_weighted s1, s64 x)
+{
+	struct mean_and_variance_weighted s2;
+	// previous weighted variance.
+	u64 var_w0 = s1.variance;
+	u8 w = s2.w = s1.w;
+	// new value weighted.
+	s64 x_w = x << w;
+	s64 diff_w = x_w - s1.mean;
+	s64 diff = fast_divpow2(diff_w, w);
+	// new mean weighted.
+	s64 u_w1     = s1.mean + diff;
+
+	BUG_ON(w % 2 != 0);
+
+	if (!s1.init) {
+		s2.mean = x_w;
+		s2.variance = 0;
+	} else {
+		s2.mean = u_w1;
+		s2.variance = ((var_w0 << w) - var_w0 + ((diff_w * (x_w - u_w1)) >> w)) >> w;
+	}
+	s2.init = true;
+
+	return s2;
+}
+
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1);
+       s64		 mean_and_variance_get_mean(struct mean_and_variance s);
+       u64		 mean_and_variance_get_variance(struct mean_and_variance s1);
+       u32		 mean_and_variance_get_stddev(struct mean_and_variance s);
+
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1, s64 v1);
+       s64			  mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s);
+       u64			  mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s);
+       u32			  mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s);
+
+#endif // MEAN_AND_VAIRANCE_H_
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3dba7a9aff..9ca88e0027 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2101,6 +2101,15 @@ config CPUMASK_KUNIT_TEST
 
 	  If unsure, say N.
 
+config MEAN_AND_VARIANCE_UNIT_TEST
+	tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	select MEAN_AND_VARIANCE
+	default KUNIT_ALL_TESTS
+	help
+	  This option enables the kunit tests for mean_and_variance module.
+	  If unsure, say N.
+
 config TEST_LIST_SORT
 	tristate "Linked list sorting test" if !KUNIT_ALL_TESTS
 	depends on KUNIT
diff --git a/lib/math/Kconfig b/lib/math/Kconfig
index 0634b428d0..7530ae9a35 100644
--- a/lib/math/Kconfig
+++ b/lib/math/Kconfig
@@ -15,3 +15,6 @@ config PRIME_NUMBERS
 
 config RATIONAL
 	tristate
+
+config MEAN_AND_VARIANCE
+	tristate
diff --git a/lib/math/Makefile b/lib/math/Makefile
index bfac26ddfc..2ef1487e01 100644
--- a/lib/math/Makefile
+++ b/lib/math/Makefile
@@ -4,6 +4,8 @@ obj-y += div64.o gcd.o lcm.o int_pow.o int_sqrt.o reciprocal_div.o
 obj-$(CONFIG_CORDIC)		+= cordic.o
 obj-$(CONFIG_PRIME_NUMBERS)	+= prime_numbers.o
 obj-$(CONFIG_RATIONAL)		+= rational.o
+obj-$(CONFIG_MEAN_AND_VARIANCE) += mean_and_variance.o
 
 obj-$(CONFIG_TEST_DIV64)	+= test_div64.o
 obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
+obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST)   += mean_and_variance_test.o
diff --git a/lib/math/mean_and_variance.c b/lib/math/mean_and_variance.c
new file mode 100644
index 0000000000..6e315d3a13
--- /dev/null
+++ b/lib/math/mean_and_variance.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Functions for incremental mean and variance.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Copyright © 2022 Daniel B. Hill
+ *
+ * Author: Daniel B. Hill <daniel@gluo.nz>
+ *
+ * Description:
+ *
+ * This is includes some incremental algorithms for mean and variance calculation
+ *
+ * Derived from the paper: https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ *
+ * Create a struct and if it's the weighted variant set the w field (weight = 2^k).
+ *
+ * Use mean_and_variance[_weighted]_update() on the struct to update it's state.
+ *
+ * Use the mean_and_variance[_weighted]_get_* functions to calculate the mean and variance, some computation
+ * is deferred to these functions for performance reasons.
+ *
+ * see lib/math/mean_and_variance_test.c for examples of usage.
+ *
+ * DO NOT access the mean and variance fields of the weighted variants directly.
+ * DO NOT change the weight after calling update.
+ */
+
+#include <linux/bug.h>
+#include <linux/compiler.h>
+#include <linux/export.h>
+#include <linux/limits.h>
+#include <linux/math.h>
+#include <linux/math64.h>
+#include <linux/mean_and_variance.h>
+#include <linux/module.h>
+
+/**
+ * mean_and_variance_update() - update a mean_and_variance struct @s1 with a new sample @v1
+ * and return it.
+ * @s1: the mean_and_variance to update.
+ * @v1: the new sample.
+ *
+ * see linked pdf equation 12.
+ */
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1)
+{
+	return mean_and_variance_update_inlined(s1, v1);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_update);
+
+/**
+ * mean_and_variance_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_get_mean(struct mean_and_variance s)
+{
+	return div64_u64(s.sum, s.n);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_mean);
+
+/**
+ * mean_and_variance_get_variance() -  get variance from @s1
+ *
+ * see linked pdf equation 12.
+ */
+u64 mean_and_variance_get_variance(struct mean_and_variance s1)
+{
+	u128 s2 = u128_div(s1.sum_squares, s1.n);
+	u64  s3 = abs(mean_and_variance_get_mean(s1));
+
+	return u128_to_u64(u128_sub(s2, u128_square(s3)));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_variance);
+
+/**
+ * mean_and_variance_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_get_stddev(struct mean_and_variance s)
+{
+	return int_sqrt64(mean_and_variance_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_stddev);
+
+/**
+ * mean_and_variance_weighted_update() - exponentially weighted variant of mean_and_variance_update()
+ * @s1: ..
+ * @s2: ..
+ *
+ * see linked pdf: function derived from equations 140-143 where alpha = 2^w.
+ * values are stored bitshifted for performance and added precision.
+ */
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1,
+								    s64 x)
+{
+	return mean_and_variance_weighted_update_inlined(s1, x);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_update);
+
+/**
+ * mean_and_variance_weighted_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s)
+{
+	return fast_divpow2(s.mean, s.w);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_mean);
+
+/**
+ * mean_and_variance_weighted_get_variance() -- get variance from @s
+ */
+u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s)
+{
+	// always positive don't need fast divpow2
+	return s.variance >> s.w;
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_variance);
+
+/**
+ * mean_and_variance_weighted_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s)
+{
+	return int_sqrt64(mean_and_variance_weighted_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_stddev);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
diff --git a/lib/math/mean_and_variance_test.c b/lib/math/mean_and_variance_test.c
new file mode 100644
index 0000000000..79a96d7307
--- /dev/null
+++ b/lib/math/mean_and_variance_test.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+#include <linux/mean_and_variance.h>
+
+#define MAX_SQR (SQRT_U64_MAX*SQRT_U64_MAX)
+
+static void mean_and_variance_basic_test(struct kunit *test)
+{
+	struct mean_and_variance s = {};
+
+	s = mean_and_variance_update(s, 2);
+	s = mean_and_variance_update(s, 2);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 0);
+	KUNIT_EXPECT_EQ(test, s.n, 2);
+
+	s = mean_and_variance_update(s, 4);
+	s = mean_and_variance_update(s, 4);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 3);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 1);
+	KUNIT_EXPECT_EQ(test, s.n, 4);
+}
+
+/*
+ * Test values computed using a spreadsheet from the psuedocode at the bottom:
+ * https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ */
+
+static void mean_and_variance_weighted_test(struct kunit *test)
+{
+	struct mean_and_variance_weighted s = {};
+
+	s.w = 2;
+
+	s = mean_and_variance_weighted_update(s, 10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+	s = mean_and_variance_weighted_update(s, 20);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+	s = mean_and_variance_weighted_update(s, 30);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+	s = (struct mean_and_variance_weighted){};
+	s.w = 2;
+
+	s = mean_and_variance_weighted_update(s, -10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+	s = mean_and_variance_weighted_update(s, -20);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+	s = mean_and_variance_weighted_update(s, -30);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+}
+
+static void mean_and_variance_weighted_advanced_test(struct kunit *test)
+{
+	struct mean_and_variance_weighted s = {};
+	s64 i;
+
+	s.w = 8;
+	for (i = 10; i <= 100; i += 10)
+		s = mean_and_variance_weighted_update(s, i);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+	s = (struct mean_and_variance_weighted){};
+
+	s.w = 8;
+	for (i = -10; i >= -100; i -= 10)
+		s = mean_and_variance_weighted_update(s, i);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+}
+
+static void mean_and_variance_fast_divpow2(struct kunit *test)
+{
+	s64 i;
+	u8 d;
+
+	for (i = 0; i < 100; i++) {
+		d = 0;
+		KUNIT_EXPECT_EQ(test, fast_divpow2(i, d), div_u64(i, 1LLU << d));
+		KUNIT_EXPECT_EQ(test, abs(fast_divpow2(-i, d)), div_u64(i, 1LLU << d));
+		for (d = 1; d < 32; d++) {
+			KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(i, d)),
+					    div_u64(i, 1 << d), "%lld %u", i, d);
+			KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(-i, d)),
+					    div_u64(i, 1 << d), "%lld %u", -i, d);
+		}
+	}
+}
+
+static void mean_and_variance_u128_basic_test(struct kunit *test)
+{
+	u128 a = u128_shl64_add(0, U64_MAX);
+	u128 a1 = u128_shl64_add(0, 1);
+	u128 b = u128_shl64_add(1, 0);
+	u128 c = u128_shl64_add(0, 1LLU << 63);
+	u128 c2 = u128_shl64_add(U64_MAX, U64_MAX);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a, a1)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a, a1)), 0);
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a1, a)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a1, a)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_sub(b, a1)), U64_MAX);
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_sub(b, a1)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_shl(c, 1)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_shl(c, 1)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_square(U64_MAX)), U64_MAX - 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_square(U64_MAX)), 1);
+
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(b, 2)), 1LLU << 63);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(c2, 2)), U64_MAX >> 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(c2, 2)), U64_MAX);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U32_MAX >> 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U64_MAX << 31);
+}
+
+static struct kunit_case mean_and_variance_test_cases[] = {
+	KUNIT_CASE(mean_and_variance_fast_divpow2),
+	KUNIT_CASE(mean_and_variance_u128_basic_test),
+	KUNIT_CASE(mean_and_variance_basic_test),
+	KUNIT_CASE(mean_and_variance_weighted_test),
+	KUNIT_CASE(mean_and_variance_weighted_advanced_test),
+	{}
+};
+
+static struct kunit_suite mean_and_variance_test_suite = {
+.name = "mean and variance tests",
+.test_cases = mean_and_variance_test_cases
+};
+
+kunit_test_suite(mean_and_variance_test_suite);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (30 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 21:04   ` Randy Dunlap
  2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
  32 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dbf3c33c31..0ac2b432f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
 C:	irc://irc.oftc.net/bcache
 F:	drivers/md/bcache/
 
+BCACHEFS:
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	fs/bcachefs/
+
 BDISP ST MEDIA DRIVER
 M:	Fabien Dessenne <fabien.dessenne@foss.st.com>
 L:	linux-media@vger.kernel.org
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
@ 2023-05-09 17:04   ` Miguel Ojeda
  2023-05-09 17:24     ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Miguel Ojeda @ 2023-05-09 17:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Miguel Ojeda, Nick Desaulniers

On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> This makes __attribute__((flatten)) available, which is used by
> bcachefs.

We already have it in mainline, so I think it is one less patch you
need to care about! :)

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
@ 2023-05-09 17:05   ` Coly Li
  2023-05-09 21:03   ` Randy Dunlap
  1 sibling, 0 replies; 186+ messages in thread
From: Coly Li @ 2023-05-09 17:05 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs



> 2023年5月10日 00:56,Kent Overstreet <kent.overstreet@linux.dev> 写道:
> 
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Coly Li <colyli@suse.de>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li

> ---
> MAINTAINERS | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> F: Documentation/devicetree/bindings/timer/
> F: drivers/clocksource/
> 
> +CLOSURES:
> +M: Kent Overstreet <kent.overstreet@linux.dev>
> +L: linux-bcachefs@vger.kernel.org
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: include/linux/closure.h
> +F: lib/closure.c
> +
> CMPC ACPI DRIVER
> M: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
> M: Daniel Oliveira Nascimento <don@syst.com.br>
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 17:04   ` Miguel Ojeda
@ 2023-05-09 17:24     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 17:24 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Miguel Ojeda, Nick Desaulniers

On Tue, May 09, 2023 at 07:04:43PM +0200, Miguel Ojeda wrote:
> On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > This makes __attribute__((flatten)) available, which is used by
> > bcachefs.
> 
> We already have it in mainline, so I think it is one less patch you
> need to care about! :)
> 
> Cheers,
> Miguel

Wonderful :)

Cheers,
Kent

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
@ 2023-05-09 18:19   ` Lorenzo Stoakes
  2023-05-09 20:15     ` Kent Overstreet
  2023-05-09 20:46   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 186+ messages in thread
From: Lorenzo Stoakes @ 2023-05-09 18:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
>
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.

Small nits -

Would be good to refer to the original patch that removed it,
i.e. 7a0e27b2a0ce ("mm: remove vmalloc_exec") something like 'patch
... folded vmalloc_exec() into its one user, however bcachefs requires this
as well so revert'.

Would also be good to mention that you are now exporting the function which
the original didn't appear to do.

>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: linux-mm@kvack.org

Another nit: I'm a vmalloc reviewer so would be good to get cc'd too :)
(forgivable mistake as very recent change!)

> ---
>  include/linux/vmalloc.h |  1 +
>  kernel/module/main.c    |  4 +---
>  mm/nommu.c              | 18 ++++++++++++++++++
>  mm/vmalloc.c            | 21 +++++++++++++++++++++
>  4 files changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 69250efa03..ff147fe115 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
> +extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
>  extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de70..9eaa89e84c 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
>
>  void * __weak module_alloc(unsigned long size)
>  {
> -	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	return vmalloc_exec(size, GFP_KERNEL);
>  }
>
>  bool __weak module_init_section(const char *name)
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a..8d9ab19e39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>
> +/**
> + *	vmalloc_exec  -  allocate virtually contiguous, executable memory
> + *	@size:		allocation size
> + *
> + *	Kernel-internal function to allocate enough pages to cover @size
> + *	the page level allocator and map them into contiguous and
> + *	executable kernel virtual space.
> + *
> + *	For tight control over page level allocator and protection flags
> + *	use __vmalloc() instead.
> + */
> +
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc(size, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  /**
>   * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
>   *	@size:		allocation size
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d36..2ebb9ea7f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> --
> 2.40.1
>

Otherwise lgtm, feel free to add:

Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
@ 2023-05-09 19:30   ` Peter Zijlstra
  2023-05-09 20:11     ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Peter Zijlstra @ 2023-05-09 19:30 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This patch adds lock_class_is_held(), which can be used to assert that a
> particular type of lock is not held.

How is lock_is_held_type() not sufficient? Which is what's used to
implement lockdep_assert_held*().


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
@ 2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 19:57     ` Kent Overstreet
  2023-05-09 20:18     ` Kent Overstreet
  0 siblings, 2 replies; 186+ messages in thread
From: Peter Zijlstra @ 2023-05-09 19:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> This adds a method to tell lockdep not to check lock ordering within a
> lock class - but to still check lock ordering w.r.t. other lock types.
> 
> This is for bcachefs, where for btree node locks we have our own
> deadlock avoidance strategy w.r.t. other btree node locks (cycle
> detection), but we still want lockdep to check lock ordering w.r.t.
> other lock types.
> 

ISTR you had a much nicer version of this where you gave a custom order
function -- what happend to that?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 19:31   ` Peter Zijlstra
@ 2023-05-09 19:57     ` Kent Overstreet
  2023-05-09 20:18     ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> > 
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> > 
> 
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Probably in the other branch that I was meaning to re-mail you separately,
clearly I hadn't pulled the latest versions back into here... expect
that shortly :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 19:30   ` Peter Zijlstra
@ 2023-05-09 20:11     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:30:39PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > This patch adds lock_class_is_held(), which can be used to assert that a
> > particular type of lock is not held.
> 
> How is lock_is_held_type() not sufficient? Which is what's used to
> implement lockdep_assert_held*().

I should've looked at that before - it returns a tristate, so it's
closer than I thought, but this is used in contexts where we don't have
a lock or lockdep_map to pass and need to pass the lock_class_key
instead.

e.g, when initializing a btree_trans, or waiting on btree node IO, we
need to assert that no btree node locks are held.

Looking at the code, __lock_is_held() -> match_held_lock() has to care
about a bunch of stuff related to subclasses that doesn't seem relevant
to lock_class_is_held() - lock_class_is_held() is practically no code in
comparison, so I'm inclined to think they should just be separate.

But I'm not the lockdep expert :) Thoughts?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 18:19   ` Lorenzo Stoakes
@ 2023-05-09 20:15     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 11:19:38AM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> >
> > This is needed for bcachefs, which dynamically generates per-btree node
> > unpack functions.
> 
> Small nits -
> 
> Would be good to refer to the original patch that removed it,
> i.e. 7a0e27b2a0ce ("mm: remove vmalloc_exec") something like 'patch
> ... folded vmalloc_exec() into its one user, however bcachefs requires this
> as well so revert'.
> 
> Would also be good to mention that you are now exporting the function which
> the original didn't appear to do.
> 
> >
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Uladzislau Rezki <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: linux-mm@kvack.org
> 
> Another nit: I'm a vmalloc reviewer so would be good to get cc'd too :)
> (forgivable mistake as very recent change!)

Thanks - folded your suggestions into the commit message, and added you
for the next posting :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 19:57     ` Kent Overstreet
@ 2023-05-09 20:18     ` Kent Overstreet
  2023-05-09 20:27       ` Waiman Long
  2023-05-10  8:59       ` Peter Zijlstra
  1 sibling, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> > 
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> > 
> 
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Actually, I spoke too soon; this patch and the other series with the
comparison function solve different problems.

For bcachefs btree node locks, we don't have a defined lock ordering at
all - we do full runtime cycle detection, so we don't want lockdep
checking for self deadlock because we're handling that but we _do_ want
lockdep checking lock ordering of btree node locks w.r.t. other locks in
the system.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:18     ` Kent Overstreet
@ 2023-05-09 20:27       ` Waiman Long
  2023-05-09 20:35         ` Kent Overstreet
  2023-05-10  8:59       ` Peter Zijlstra
  1 sibling, 1 reply; 186+ messages in thread
From: Waiman Long @ 2023-05-09 20:27 UTC (permalink / raw)
  To: Kent Overstreet, Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Boqun Feng


On 5/9/23 16:18, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>> This adds a method to tell lockdep not to check lock ordering within a
>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>
>>> This is for bcachefs, where for btree node locks we have our own
>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>> other lock types.
>>>
>> ISTR you had a much nicer version of this where you gave a custom order
>> function -- what happend to that?
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
>
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Maybe you can use lock_set_novalidate_class() instead.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:27       ` Waiman Long
@ 2023-05-09 20:35         ` Kent Overstreet
  2023-05-09 21:37           ` Waiman Long
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, linux-bcachefs,
	Ingo Molnar, Will Deacon, Boqun Feng

On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
> 
> On 5/9/23 16:18, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > > 
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > > 
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> > 
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
> 
> Maybe you can use lock_set_novalidate_class() instead.

No, we want that to go away, this is the replacement.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
@ 2023-05-09 20:46   ` Christoph Hellwig
  2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-10 14:18   ` Christophe Leroy
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-09 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.

No, we will never add back a way for random code allocating executable
memory in kernel space.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
  2023-05-09 17:05   ` Coly Li
@ 2023-05-09 21:03   ` Randy Dunlap
  1 sibling, 0 replies; 186+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:03 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Coly Li



On 5/9/23 09:56, Kent Overstreet wrote:
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Coly Li <colyli@suse.de>
> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
>  F:	Documentation/devicetree/bindings/timer/
>  F:	drivers/clocksource/
>  
> +CLOSURES:

No colon at the end of the line.

> +M:	Kent Overstreet <kent.overstreet@linux.dev>
> +L:	linux-bcachefs@vger.kernel.org
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	include/linux/closure.h
> +F:	lib/closure.c
> +
>  CMPC ACPI DRIVER
>  M:	Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
>  M:	Daniel Oliveira Nascimento <don@syst.com.br>

-- 
~Randy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
@ 2023-05-09 21:03   ` Randy Dunlap
  0 siblings, 0 replies; 186+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:03 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs



On 5/9/23 09:56, Kent Overstreet wrote:
> lib/generic-radix-tree.c is a simple radix tree that supports storing
> arbitrary types. Add a maintainers entry for it.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  MAINTAINERS | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5d76169140..c550f5909e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8615,6 +8615,13 @@ F:	Documentation/devicetree/bindings/power/power?domain*
>  F:	drivers/base/power/domain*.c
>  F:	include/linux/pm_domain.h
>  
> +GENERIC RADIX TREE:

No colon at the end of the line.

> +M:	Kent Overstreet <kent.overstreet@linux.dev>
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	include/linux/generic-radix-tree.h
> +F:	lib/generic-radix-tree.c
> +
>  GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
>  M:	Eugen Hristev <eugen.hristev@microchip.com>
>  L:	linux-input@vger.kernel.org

-- 
~Randy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
@ 2023-05-09 21:04   ` Randy Dunlap
  2023-05-09 21:07     ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:04 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs



On 5/9/23 09:56, Kent Overstreet wrote:
> bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> it.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  MAINTAINERS | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dbf3c33c31..0ac2b432f0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
>  C:	irc://irc.oftc.net/bcache
>  F:	drivers/md/bcache/
>  
> +BCACHEFS:

No colon at the end of the line.


> +M:	Kent Overstreet <kent.overstreet@linux.dev>
> +L:	linux-bcachefs@vger.kernel.org
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	fs/bcachefs/
> +
>  BDISP ST MEDIA DRIVER
>  M:	Fabien Dessenne <fabien.dessenne@foss.st.com>
>  L:	linux-media@vger.kernel.org

-- 
~Randy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 21:04   ` Randy Dunlap
@ 2023-05-09 21:07     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:07 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs

On Tue, May 09, 2023 at 02:04:00PM -0700, Randy Dunlap wrote:
> 
> 
> On 5/9/23 09:56, Kent Overstreet wrote:
> > bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> > it.
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > ---
> >  MAINTAINERS | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index dbf3c33c31..0ac2b432f0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
> >  C:	irc://irc.oftc.net/bcache
> >  F:	drivers/md/bcache/
> >  
> > +BCACHEFS:
> 
> No colon at the end of the line.

Thanks, updated.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 20:46   ` Christoph Hellwig
@ 2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-09 21:29       ` Kent Overstreet
                         ` (2 more replies)
  0 siblings, 3 replies; 186+ messages in thread
From: Lorenzo Stoakes @ 2023-05-09 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> >
> > This is needed for bcachefs, which dynamically generates per-btree node
> > unpack functions.
>
> No, we will never add back a way for random code allocating executable
> memory in kernel space.

Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
reinstating a helper function because the code is now used in more than one
place (at lsf/mm so a little distracted :)

But it being exported is a problem. Perhaps there's another way of acheving the
same aim without having to do so?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:12     ` Lorenzo Stoakes
@ 2023-05-09 21:29       ` Kent Overstreet
  2023-05-10  6:48         ` Eric Biggers
  2023-05-10 11:56         ` David Laight
  2023-05-09 21:43       ` Darrick J. Wong
  2023-05-13 13:25       ` Lorenzo Stoakes
  2 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > >
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> >
> > No, we will never add back a way for random code allocating executable
> > memory in kernel space.
> 
> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> reinstating a helper function because the code is now used in more than one
> place (at lsf/mm so a little distracted :)
> 
> But it being exported is a problem. Perhaps there's another way of acheving the
> same aim without having to do so?

None that I see.

The background is that bcachefs generates a per btree node unpack
function, based on the packed format for that btree node, for unpacking
keys within that node. The unpack function is only ~50 bytes, and for
locality we want it to be located with the btree node's other in-memory
lookup tables so they can be prefetched all at once.

Here's the codegen:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:35         ` Kent Overstreet
@ 2023-05-09 21:37           ` Waiman Long
  0 siblings, 0 replies; 186+ messages in thread
From: Waiman Long @ 2023-05-09 21:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, linux-bcachefs,
	Ingo Molnar, Will Deacon, Boqun Feng

On 5/9/23 16:35, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
>> On 5/9/23 16:18, Kent Overstreet wrote:
>>> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>>>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>>>> This adds a method to tell lockdep not to check lock ordering within a
>>>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>>>
>>>>> This is for bcachefs, where for btree node locks we have our own
>>>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>>>> other lock types.
>>>>>
>>>> ISTR you had a much nicer version of this where you gave a custom order
>>>> function -- what happend to that?
>>> Actually, I spoke too soon; this patch and the other series with the
>>> comparison function solve different problems.
>>>
>>> For bcachefs btree node locks, we don't have a defined lock ordering at
>>> all - we do full runtime cycle detection, so we don't want lockdep
>>> checking for self deadlock because we're handling that but we _do_ want
>>> lockdep checking lock ordering of btree node locks w.r.t. other locks in
>>> the system.
>> Maybe you can use lock_set_novalidate_class() instead.
> No, we want that to go away, this is the replacement.

OK, you can mention that in the commit log then.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-09 21:29       ` Kent Overstreet
@ 2023-05-09 21:43       ` Darrick J. Wong
  2023-05-09 21:54         ` Kent Overstreet
  2023-05-13 13:25       ` Lorenzo Stoakes
  2 siblings, 1 reply; 186+ messages in thread
From: Darrick J. Wong @ 2023-05-09 21:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, Kent Overstreet, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > >
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> >
> > No, we will never add back a way for random code allocating executable
> > memory in kernel space.
> 
> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> reinstating a helper function because the code is now used in more than one
> place (at lsf/mm so a little distracted :)
> 
> But it being exported is a problem. Perhaps there's another way of acheving the
> same aim without having to do so?

I already trolled Kent with this on IRC, but for the parts of bcachefs
that want better assembly code than whatever gcc generates from the C
source, could you compile code to BPF and then let the BPF JIT engines
turn that into machine code for you?

(also distracted by LSFMM)

--D

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:43       ` Darrick J. Wong
@ 2023-05-09 21:54         ` Kent Overstreet
  2023-05-11  5:33           ` Theodore Ts'o
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 02:43:19PM -0700, Darrick J. Wong wrote:
> On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> > On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > >
> > > > This is needed for bcachefs, which dynamically generates per-btree node
> > > > unpack functions.
> > >
> > > No, we will never add back a way for random code allocating executable
> > > memory in kernel space.
> > 
> > Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> > reinstating a helper function because the code is now used in more than one
> > place (at lsf/mm so a little distracted :)
> > 
> > But it being exported is a problem. Perhaps there's another way of acheving the
> > same aim without having to do so?
> 
> I already trolled Kent with this on IRC, but for the parts of bcachefs
> that want better assembly code than whatever gcc generates from the C
> source, could you compile code to BPF and then let the BPF JIT engines
> turn that into machine code for you?

It's an intriguing idea, but it'd be a _lot_ of work and this is old
code that's never had a single bug - I'm not in a hurry to rewrite it.

And there would still be the issue that we've still got lots of little
unpack functions that go with other tables; we can't just burn a full
page per unpack function, that would waste way too much memory, and if
we put them together then we're stuck writing a whole nother allocator
- nope, and then we're also mucking with the memory layout of the data
structures used in the very hottest paths in the filesystem - I'm very
wary of introducing performance regressions there.

I think it'd be much more practical to find some way of making
vmalloc_exec() more palatable. What are the exact concerns?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
@ 2023-05-10  1:07   ` Jan Kara
  2023-05-10  6:18     ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Jan Kara @ 2023-05-10  1:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Jan Kara, Darrick J . Wong

On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This is used by bcachefs to fix a page cache coherency issue with
> O_DIRECT writes.
> 
> Also relevant: mapping->invalidate_lock, see below.
> 
> O_DIRECT writes (and other filesystem operations that modify file data
> while bypassing the page cache) need to shoot down ranges of the page
> cache - and additionally, need locking to prevent those pages from
> pulled back in.
> 
> But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> and the page fault handler will need to take that same lock - this is a
> classic recursive deadlock if userspace has mmaped the file they're DIO
> writing to and uses those pages for the buffer to write from, and it's a
> lock ordering deadlock in general.
> 
> Thus we need a way to signal from the dio code to the page fault handler
> when we already are holding the pagecache add lock on an address space -
> this patch just adds a member to task_struct for this purpose. For now
> only bcachefs is implementing this locking, though it may be moved out
> of bcachefs and made available to other filesystems in the future.

It would be nice to have at least a link to the code that's actually using
the field you are adding.

Also I think we were already through this discussion [1] and we ended up
agreeing that your scheme actually solves only the AA deadlock but a
malicious userspace can easily create AB BA deadlock by running direct IO
to file A using mapped file B as a buffer *and* direct IO to file B using
mapped file A as a buffer.

[1] https://lore.kernel.org/all/20191218124052.GB19387@quack2.suse.cz

> ---------------------------------
> 
> The closest current VFS equivalent is mapping->invalidate_lock, which
> comes from XFS. However, it's not used by direct IO.  Instead, direct IO
> paths shoot down the page cache twice - before starting the IO and at
> the end, and they're still technically racy w.r.t. page cache coherency.
> 
> This is a more complete approach: in the future we might consider
> replacing mapping->invalidate_lock with the bcachefs code.

Yes, and this is because we never provided 100% consistent buffered VS
direct IO behavior on the same file exactly because we never found the
complexity worth the usefulness...

								Honza

> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  include/linux/sched.h | 1 +
>  init/init_task.c      | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b..f2a56f64f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -869,6 +869,7 @@ struct task_struct {
>  
>  	struct mm_struct		*mm;
>  	struct mm_struct		*active_mm;
> +	struct address_space		*faults_disabled_mapping;
>  
>  	int				exit_state;
>  	int				exit_code;
> diff --git a/init/init_task.c b/init/init_task.c
> index ff6c4b9bfe..f703116e05 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -85,6 +85,7 @@ struct task_struct init_task
>  	.nr_cpus_allowed= NR_CPUS,
>  	.mm		= NULL,
>  	.active_mm	= &init_mm,
> +	.faults_disabled_mapping = NULL,
>  	.restart_block	= {
>  		.fn = do_no_restart_syscall,
>  	},
> -- 
> 2.40.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 15/32] bcache: move closures to lib/
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
@ 2023-05-10  1:10   ` Randy Dunlap
  0 siblings, 0 replies; 186+ messages in thread
From: Randy Dunlap @ 2023-05-10  1:10 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Coly Li



On 5/9/23 09:56, Kent Overstreet wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 39d1d93164..3dba7a9aff 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
>  	  This is a relatively cheap check but if you care about maximum
>  	  performance, say N.
>  
> +config DEBUG_CLOSURES
> +	bool "Debug closures (bcache async widgits)"
> +	depends on CLOSURES
> +	select DEBUG_FS
> +	help
> +	Keeps all active closures in a linked list and provides a debugfs
> +	interface to list them, which makes it possible to see asynchronous
> +	operations that get stuck.

According to coding-style.rst, the help text (3 lines above) should be
indented with 2 additional spaces.

> +	help
> +	  Keeps all active closures in a linked list and provides a debugfs
> +	  interface to list them, which makes it possible to see asynchronous
> +	  operations that get stuck.

-- 
~Randy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
@ 2023-05-10  2:20   ` kernel test robot
  2023-05-11  2:08   ` kernel test robot
  1 sibling, 0 replies; 186+ messages in thread
From: kernel test robot @ 2023-05-10  2:20 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: llvm, oe-kbuild-all, Kent Overstreet, Alexander Viro, Matthew Wilcox

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230509]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: i386-randconfig-a002 (https://download.01.org/0day-ci/archive/20230510/202305101003.uncpRKqA-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
        git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305101003.uncpRKqA-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> lib/iov_iter.c:839:16: warning: comparison of distinct pointer types ('typeof (bytes) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - (offset & (~(((1UL) << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
                   unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
   #define min(x, y)       __careful_cmp(x, y, <)
                           ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
           __builtin_choose_expr(__safe_cmp(x, y), \
                                 ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
                   (__typecheck(x, y) && __no_side_effects(x, y))
                    ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
           (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                      ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   1 warning generated.


vim +839 lib/iov_iter.c

   825	
   826	size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
   827					   size_t bytes, struct iov_iter *i)
   828	{
   829		size_t ret = 0;
   830	
   831		if (WARN_ON(offset + bytes > folio_size(folio)))
   832			return 0;
   833		if (WARN_ON_ONCE(!i->data_source))
   834			return 0;
   835	
   836	#ifdef CONFIG_HIGHMEM
   837		while (bytes) {
   838			struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
 > 839			unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
   840			unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
   841	
   842			offset	+= r;
   843			bytes	-= r;
   844			ret	+= r;
   845	
   846			if (r != b)
   847				break;
   848		}
   849	#else
   850		ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
   851	#endif
   852	
   853		return ret;
   854	}
   855	EXPORT_SYMBOL(copy_folio_from_iter_atomic);
   856	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
@ 2023-05-10  4:45   ` Dave Chinner
  2023-05-16 15:45     ` Christian Brauner
  2023-05-23  9:28   ` (subset) " Christian Brauner
  1 sibling, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-05-10  4:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Dave Chinner,
	Alexander Viro, Christian Brauner

On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because scalability of the global inode_hash_lock really, really
> sucks.
> 
> 32-way concurrent create on a couple of different filesystems
> before:
> 
> -   52.13%     0.04%  [kernel]            [k] ext4_create
>    - 52.09% ext4_create
>       - 41.03% __ext4_new_inode
>          - 29.92% insert_inode_locked
>             - 25.35% _raw_spin_lock
>                - do_raw_spin_lock
>                   - 24.97% __pv_queued_spin_lock_slowpath
> 
> -   72.33%     0.02%  [kernel]            [k] do_filp_open
>    - 72.31% do_filp_open
>       - 72.28% path_openat
>          - 57.03% bch2_create
>             - 56.46% __bch2_create
>                - 40.43% inode_insert5
>                   - 36.07% _raw_spin_lock
>                      - do_raw_spin_lock
>                           35.86% __pv_queued_spin_lock_slowpath
>                     4.02% find_inode
> 
> Convert the inode hash table to a RCU-aware hash-bl table just like
> the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because the hash value that is used to hash the inode is
> generated from the inode itself - filesystems can provide this
> themselves so we have to either store the hash or the head pointer
> in the inode to be able to find the right list head for removal...
> 
> Same workload after:
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

I have been maintaining this patchset uptodate in my own local trees
and the code in this patch looks the same. The commit message above,
however, has been mangled. The full commit message should be:

vfs: inode cache conversion to hash-bl

Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.

Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a 5.10 kernel:

-   52.13%     0.04%  [kernel]            [k] ext4_create
   - 52.09% ext4_create
      - 41.03% __ext4_new_inode
         - 29.92% insert_inode_locked
            - 25.35% _raw_spin_lock
               - do_raw_spin_lock
                  - 24.97% __pv_queued_spin_lock_slowpath


-   72.33%     0.02%  [kernel]            [k] do_filp_open
   - 72.31% do_filp_open
      - 72.28% path_openat
         - 57.03% bch2_create
            - 56.46% __bch2_create
               - 40.43% inode_insert5
                  - 36.07% _raw_spin_lock
                     - do_raw_spin_lock
                          35.86% __pv_queued_spin_lock_slowpath
                    4.02% find_inode

btrfs was tested but it is limited by internal lock contention at
>=2 threads on this workload, so never hammers the inode cache lock
hard enough for this change to matter to it's performance.

However, both bcachefs and ext4 demonstrate poor scaling at >=8
threads on concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent create with variying thread count (files/s):

                ext4                    bcachefs
threads         vanilla  patched        vanilla patched
2               117k     112k            80k     85k
4               185k     190k           133k    145k
8               303k     346k           185k    255k
16              389k     465k           190k    420k
32              360k     437k           142k    481k

CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
halved on the patched kernel, while performance has increased
marginally on ext4 and massively on bcachefs. Internal filesystem
algorithms now limit performance on these workloads, not the global
inode_hash_lock.

Profile of the workloads on the patched kernels:

-   35.94%     0.07%  [kernel]                  [k] ext4_create
   - 35.87% ext4_create
      - 20.45% __ext4_new_inode
...
           3.36% insert_inode_locked

   - 78.43% do_filp_open
      - 78.36% path_openat
         - 53.95% bch2_create
            - 47.99% __bch2_create
....
              - 7.57% inode_insert5
                    6.94% find_inode

Spinlock contention is largely gone from the inode hash operations
and the filesystems are limited by contention in their internal
algorithms.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

Other than that, the diffstat is the same and I don't see any obvious
differences in the code comapred to what I've been running locally.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 21/32] hlist-bl: add hlist_bl_fake()
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
@ 2023-05-10  4:48   ` Dave Chinner
  2023-05-23  9:27   ` (subset) " Christian Brauner
  1 sibling, 0 replies; 186+ messages in thread
From: Dave Chinner @ 2023-05-10  4:48 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Dave Chinner

On Tue, May 09, 2023 at 12:56:46PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> in preparation for switching the VFS inode cache over the hlist_bl
  In

> lists, we nee dto be able to fake a list node that looks like it is
            need to

> hased for correct operation of filesystems that don't directly use
  hashed

> the VFS indoe cache.
          inode cache hash index.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-10  1:07   ` Jan Kara
@ 2023-05-10  6:18     ` Kent Overstreet
  2023-05-23 13:34       ` Jan Kara
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-10  6:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Darrick J . Wong, dhowells

On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > This is used by bcachefs to fix a page cache coherency issue with
> > O_DIRECT writes.
> > 
> > Also relevant: mapping->invalidate_lock, see below.
> > 
> > O_DIRECT writes (and other filesystem operations that modify file data
> > while bypassing the page cache) need to shoot down ranges of the page
> > cache - and additionally, need locking to prevent those pages from
> > pulled back in.
> > 
> > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > and the page fault handler will need to take that same lock - this is a
> > classic recursive deadlock if userspace has mmaped the file they're DIO
> > writing to and uses those pages for the buffer to write from, and it's a
> > lock ordering deadlock in general.
> > 
> > Thus we need a way to signal from the dio code to the page fault handler
> > when we already are holding the pagecache add lock on an address space -
> > this patch just adds a member to task_struct for this purpose. For now
> > only bcachefs is implementing this locking, though it may be moved out
> > of bcachefs and made available to other filesystems in the future.
> 
> It would be nice to have at least a link to the code that's actually using
> the field you are adding.

Bit of a trick to link to a _later_ patch in the series from a commit
message, but...

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

> Also I think we were already through this discussion [1] and we ended up
> agreeing that your scheme actually solves only the AA deadlock but a
> malicious userspace can easily create AB BA deadlock by running direct IO
> to file A using mapped file B as a buffer *and* direct IO to file B using
> mapped file A as a buffer.

No, that's definitely handled (and you can see it in the code I linked),
and I wrote a torture test for fstests as well.

David Howells was also just running into a strange locking situation with
iov_iters and recursive gups - I don't recall all the details, but it
sounded like this might be a solution for that. David, did you have
thoughts on that?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:29       ` Kent Overstreet
@ 2023-05-10  6:48         ` Eric Biggers
  2023-05-12 18:36           ` Kent Overstreet
  2023-05-10 11:56         ` David Laight
  1 sibling, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-10  6:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 05:29:10PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> > On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > >
> > > > This is needed for bcachefs, which dynamically generates per-btree node
> > > > unpack functions.
> > >
> > > No, we will never add back a way for random code allocating executable
> > > memory in kernel space.
> > 
> > Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> > reinstating a helper function because the code is now used in more than one
> > place (at lsf/mm so a little distracted :)
> > 
> > But it being exported is a problem. Perhaps there's another way of acheving the
> > same aim without having to do so?
> 
> None that I see.
> 
> The background is that bcachefs generates a per btree node unpack
> function, based on the packed format for that btree node, for unpacking
> keys within that node. The unpack function is only ~50 bytes, and for
> locality we want it to be located with the btree node's other in-memory
> lookup tables so they can be prefetched all at once.
> 
> Here's the codegen:
> 
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

Well, it's a cool trick, but it's not clear that it actually belongs in
production kernel code.  What else in the kernel actually does dynamic codegen?
Just BPF, I think?

Among other issues, this is entirely architecture-specific, and it may cause
interoperability issues with various other features, including security
features.  Is it really safe to leave a W&X page around, for example?

What seems to be missing is any explanation for what we're actually getting from
this extremely unusual solution that cannot be gained any other way.  What is
unique about bcachefs that it really needs something like this?

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:18     ` Kent Overstreet
  2023-05-09 20:27       ` Waiman Long
@ 2023-05-10  8:59       ` Peter Zijlstra
  2023-05-10 20:38         ` Kent Overstreet
  2023-05-12 20:49         ` Kent Overstreet
  1 sibling, 2 replies; 186+ messages in thread
From: Peter Zijlstra @ 2023-05-10  8:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > This adds a method to tell lockdep not to check lock ordering within a
> > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > 
> > > This is for bcachefs, where for btree node locks we have our own
> > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > other lock types.
> > > 
> > 
> > ISTR you had a much nicer version of this where you gave a custom order
> > function -- what happend to that?
> 
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
> 
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Have you read the ww_mutex code? If not, please do so, it does similar
things.

The way it gets around the self-nesting check is by using the nest_lock
annotation, the acquire context itself also has a dep_map for this
purpose.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:29       ` Kent Overstreet
  2023-05-10  6:48         ` Eric Biggers
@ 2023-05-10 11:56         ` David Laight
  1 sibling, 0 replies; 186+ messages in thread
From: David Laight @ 2023-05-10 11:56 UTC (permalink / raw)
  To: 'Kent Overstreet', Lorenzo Stoakes
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

From: Kent Overstreet
> Sent: 09 May 2023 22:29
...
> The background is that bcachefs generates a per btree node unpack
> function, based on the packed format for that btree node, for unpacking
> keys within that node. The unpack function is only ~50 bytes, and for
> locality we want it to be located with the btree node's other in-memory
> lookup tables so they can be prefetched all at once.

Loading data into the d-cache isn't going to load code into
the i-cache.
Indeed you don't want to be mixing code and data in the same
cache line - because it just wastes space in the cache.

Looks to me like you could have a few different unpack
functions and pick the correct one based on the packed format.
Quite likely the code would be just as fast (if longer)
when you allow for parallel execution on modern cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
  2023-05-09 20:46   ` Christoph Hellwig
@ 2023-05-10 14:18   ` Christophe Leroy
  2023-05-10 15:05   ` Johannes Thumshirn
  2023-06-19  9:19   ` Mark Rutland
  4 siblings, 0 replies; 186+ messages in thread
From: Christophe Leroy @ 2023-05-10 14:18 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm



Le 09/05/2023 à 18:56, Kent Overstreet a écrit :
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: linux-mm@kvack.org
> ---
>   include/linux/vmalloc.h |  1 +
>   kernel/module/main.c    |  4 +---
>   mm/nommu.c              | 18 ++++++++++++++++++
>   mm/vmalloc.c            | 21 +++++++++++++++++++++
>   4 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 69250efa03..ff147fe115 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>   extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
> +extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>   extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
>   extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de70..9eaa89e84c 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
>   
>   void * __weak module_alloc(unsigned long size)
>   {
> -	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	return vmalloc_exec(size, GFP_KERNEL);
>   }
>   
>   bool __weak module_init_section(const char *name)
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a..8d9ab19e39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
>   }
>   EXPORT_SYMBOL(vzalloc_node);
>   
> +/**
> + *	vmalloc_exec  -  allocate virtually contiguous, executable memory
> + *	@size:		allocation size
> + *
> + *	Kernel-internal function to allocate enough pages to cover @size
> + *	the page level allocator and map them into contiguous and
> + *	executable kernel virtual space.
> + *
> + *	For tight control over page level allocator and protection flags
> + *	use __vmalloc() instead.
> + */
> +
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc(size, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>   /**
>    * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
>    *	@size:		allocation size
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d36..2ebb9ea7f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
>   }
>   EXPORT_SYMBOL(vzalloc_node);
>   
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}

That cannot work. The VMALLOC space is mapped non-exec on powerpc/32. 
You have to allocate between MODULES_VADDR and MODULES_END if you want 
something executable so you must use module_alloc() see 
https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/powerpc/kernel/module.c#L108

> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>   #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>   #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>   #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
                     ` (2 preceding siblings ...)
  2023-05-10 14:18   ` Christophe Leroy
@ 2023-05-10 15:05   ` Johannes Thumshirn
  2023-05-11 22:28     ` Kees Cook
  2023-06-19  9:19   ` Mark Rutland
  4 siblings, 1 reply; 186+ messages in thread
From: Johannes Thumshirn @ 2023-05-10 15:05 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs, Kees Cook
  Cc: Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On 09.05.23 18:56, Kent Overstreet wrote:
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);

Uh W+X memory reagions.
The 90s called, they want their shellcode back.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-10  8:59       ` Peter Zijlstra
@ 2023-05-10 20:38         ` Kent Overstreet
  2023-05-11  8:25           ` Peter Zijlstra
  2023-05-12 20:49         ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-10 20:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > > 
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > > 
> > > 
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > 
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> > 
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
> 
> Have you read the ww_mutex code? If not, please do so, it does similar
> things.
> 
> The way it gets around the self-nesting check is by using the nest_lock
> annotation, the acquire context itself also has a dep_map for this
> purpose.

This might work.

I was confused for a good bit when reading tho code to figure out how
it works - nest_lock seems to be a pretty bad name, it's really not a
lock. acquire_ctx?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
  2023-05-10  2:20   ` kernel test robot
@ 2023-05-11  2:08   ` kernel test robot
  1 sibling, 0 replies; 186+ messages in thread
From: kernel test robot @ 2023-05-11  2:08 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: oe-kbuild-all, Kent Overstreet, Alexander Viro, Matthew Wilcox

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: powerpc-randconfig-s042-20230509 (https://download.01.org/0day-ci/archive/20230511/202305110949.RCcHYzkJ-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
        git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305110949.RCcHYzkJ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> lib/iov_iter.c:839:30: sparse: sparse: incompatible types in comparison expression (different type sizes):
>> lib/iov_iter.c:839:30: sparse:    unsigned int *
>> lib/iov_iter.c:839:30: sparse:    unsigned long *

vim +839 lib/iov_iter.c

   825	
   826	size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
   827					   size_t bytes, struct iov_iter *i)
   828	{
   829		size_t ret = 0;
   830	
   831		if (WARN_ON(offset + bytes > folio_size(folio)))
   832			return 0;
   833		if (WARN_ON_ONCE(!i->data_source))
   834			return 0;
   835	
   836	#ifdef CONFIG_HIGHMEM
   837		while (bytes) {
   838			struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
 > 839			unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
   840			unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
   841	
   842			offset	+= r;
   843			bytes	-= r;
   844			ret	+= r;
   845	
   846			if (r != b)
   847				break;
   848		}
   849	#else
   850		ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
   851	#endif
   852	
   853		return ret;
   854	}
   855	EXPORT_SYMBOL(copy_folio_from_iter_atomic);
   856	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:54         ` Kent Overstreet
@ 2023-05-11  5:33           ` Theodore Ts'o
  2023-05-11  5:44             ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Theodore Ts'o @ 2023-05-11  5:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Darrick J. Wong, Lorenzo Stoakes, Christoph Hellwig,
	linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 05:54:26PM -0400, Kent Overstreet wrote:
> 
> I think it'd be much more practical to find some way of making
> vmalloc_exec() more palatable. What are the exact concerns?

Having a vmalloc_exec() function (whether it is not exported at all,
or exported as a GPL symbol) makes it *much* easier for an exploit
writer, since it's a super convenient gadget for use with
Returned-oriented-programming[1] to create a writeable, executable
space that could then be filled with arbitrary code of the exploit
author's arbitrary desire.

[1] https://en.wikipedia.org/wiki/Return-oriented_programming

The other thing I'll note from examining the code generator, is that
it appears that bcachefs *only* has support for x86_64.  This brings
me back to the days of my youth when all the world was a Vax[2].  :-)

   10.  Thou shalt foreswear, renounce, and abjure the vile heresy
        which claimeth that ``All the world's a VAX'', and have no commerce
	with the benighted heathens who cling to this barbarous belief, that
 	the days of thy program may be long even though the days of thy
	current machine be short.

	[ This particular heresy bids fair to be replaced by ``All the
	world's a Sun'' or ``All the world's a 386'' (this latter
	being a particularly revolting invention of Satan), but the
	words apply to all such without limitation. Beware, in
	particular, of the subtle and terrible ``All the world's a
	32-bit machine'', which is almost true today but shall cease
	to be so before thy resume grows too much longer.]

[2] The Ten Commandments for C Programmers (Annotated Edition)
    https://www.lysator.liu.se/c/ten-commandments.html

Seriously, does this mean that bcachefs won't work on Arm systems
(arm32 or arm64)?  Or Risc V systems?  Or S/390's?  Or Power
architectuers?  Or Itanium or PA-RISC systems?  (OK, I really don't
care all that much about those last two.  :-)


When people ask me why file systems are so hard to make enterprise
ready, I tell them to recall the general advice given to people to
write secure, robust systems: (a) avoid premature optimization, (b)
avoid fine-grained, multi-threaded programming, as much as possible,
because locking bugs are a b*tch, and (c) avoid unnecessary global
state as much as possible.

File systems tend to violate all of these precepts: (a) people chase
benchmark optimizations to the exclusion of all else, because people
have an unhealthy obsession with Phornix benchmark articles, (b) file
systems tend to be inherently multi-threaded, with lots of locks, and
(c) file systems are all about managing global state in the form of
files, directories, etc.

However, hiding a miniature architecture-specific compiler inside a
file system seems to be a rather blatent example of "premature
optimization".

							- Ted

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-11  5:33           ` Theodore Ts'o
@ 2023-05-11  5:44             ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-11  5:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Lorenzo Stoakes, Christoph Hellwig,
	linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, linux-mm

On Thu, May 11, 2023 at 01:33:12AM -0400, Theodore Ts'o wrote:
> Seriously, does this mean that bcachefs won't work on Arm systems
> (arm32 or arm64)?  Or Risc V systems?  Or S/390's?  Or Power
> architectuers?  Or Itanium or PA-RISC systems?  (OK, I really don't
> care all that much about those last two.  :-)

No :)

My CI servers are arm64 servers. There's a bch2_bkey_unpack_key()
written in C, that works on any architecture. But specializing for a
particular format is a not-insignificant performance improvement, so
writing an arm64 version has been on my todo list.

> When people ask me why file systems are so hard to make enterprise
> ready, I tell them to recall the general advice given to people to
> write secure, robust systems: (a) avoid premature optimization, (b)
> avoid fine-grained, multi-threaded programming, as much as possible,
> because locking bugs are a b*tch, and (c) avoid unnecessary global
> state as much as possible.
> 
> File systems tend to violate all of these precepts: (a) people chase
> benchmark optimizations to the exclusion of all else, because people
> have an unhealthy obsession with Phornix benchmark articles, (b) file
> systems tend to be inherently multi-threaded, with lots of locks, and
> (c) file systems are all about managing global state in the form of
> files, directories, etc.
> 
> However, hiding a miniature architecture-specific compiler inside a
> file system seems to be a rather blatent example of "premature
> optimization".

Ted, this project is _15_ years old.

I'm getting ready to write a full explanation of what this is for and
why it's important, I've just been busy with the conference - and I want
to write something good, that provides all the context.

I've also been mulling over fallback options, but I don't see any good
ones. The unspecialized, C version of unpack has branches (the absolute
minimum, I took my time when I was writing that code too); the
specialized versions are branchless and _much_ smaller, and the only way
to do that specialization is with some form of dynamic codegen.

But I do owe you all a detailed walkthrough of what this is all about,
so you'll get it in the next day or so.

Cheers,
Kent

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-10 20:38         ` Kent Overstreet
@ 2023-05-11  8:25           ` Peter Zijlstra
  2023-05-11  9:32             ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Peter Zijlstra @ 2023-05-11  8:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:

> > Have you read the ww_mutex code? If not, please do so, it does similar
> > things.
> > 
> > The way it gets around the self-nesting check is by using the nest_lock
> > annotation, the acquire context itself also has a dep_map for this
> > purpose.
> 
> This might work.
> 
> I was confused for a good bit when reading tho code to figure out how
> it works - nest_lock seems to be a pretty bad name, it's really not a
> lock. acquire_ctx?

That's just how ww_mutex uses it, the annotation itself comes from
mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
mmap_sem) is used to serialize multi acquisition of vma locks.

That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
and therefore by holding mmap_lock you serialize the whole thing and can
take them in any order you like.

Perhaps, now, all these many years later another name would've made more
sense, but I don't think it's worth the hassle of the tree-wide rename
(there's a few other users since).

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-11  8:25           ` Peter Zijlstra
@ 2023-05-11  9:32             ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-11  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Thu, May 11, 2023 at 10:25:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> 
> > > Have you read the ww_mutex code? If not, please do so, it does similar
> > > things.
> > > 
> > > The way it gets around the self-nesting check is by using the nest_lock
> > > annotation, the acquire context itself also has a dep_map for this
> > > purpose.
> > 
> > This might work.
> > 
> > I was confused for a good bit when reading tho code to figure out how
> > it works - nest_lock seems to be a pretty bad name, it's really not a
> > lock. acquire_ctx?
> 
> That's just how ww_mutex uses it, the annotation itself comes from
> mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
> mmap_sem) is used to serialize multi acquisition of vma locks.
> 
> That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
> anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
> and therefore by holding mmap_lock you serialize the whole thing and can
> take them in any order you like.
> 
> Perhaps, now, all these many years later another name would've made more
> sense, but I don't think it's worth the hassle of the tree-wide rename
> (there's a few other users since).

Thanks for the history lesson :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2023-05-11 12:14   ` Jan Engelhardt
  2023-05-12 20:58     ` Kent Overstreet
  2023-05-14 12:15   ` Jeff Layton
  1 sibling, 1 reply; 186+ messages in thread
From: Jan Engelhardt @ 2023-05-11 12:14 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng


On Tuesday 2023-05-09 18:56, Kent Overstreet wrote:
>--- /dev/null
>+++ b/include/linux/six.h
>@@ -0,0 +1,210 @@
>+ * There are also operations that take the lock type as a parameter, where the
>+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
>+ *
>+ *   six_lock_type(lock, type)
>+ *   six_unlock_type(lock, type)
>+ *   six_relock(lock, type, seq)
>+ *   six_trylock_type(lock, type)
>+ *   six_trylock_convert(lock, from, to)
>+ *
>+ * A lock may be held multiple types by the same thread (for read or intent,

"multiple times"

>+// SPDX-License-Identifier: GPL-2.0

The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
please edit.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-10 15:05   ` Johannes Thumshirn
@ 2023-05-11 22:28     ` Kees Cook
  2023-05-12 18:41       ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Kees Cook @ 2023-05-11 22:28 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> On 09.05.23 18:56, Kent Overstreet wrote:
> > +/**
> > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > + * @size:	  allocation size
> > + *
> > + * Kernel-internal function to allocate enough pages to cover @size
> > + * the page level allocator and map them into contiguous and
> > + * executable kernel virtual space.
> > + *
> > + * For tight control over page level allocator and protection flags
> > + * use __vmalloc() instead.
> > + *
> > + * Return: pointer to the allocated memory or %NULL on error
> > + */
> > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > +{
> > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > +			NUMA_NO_NODE, __builtin_return_address(0));
> > +}
> > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> 
> Uh W+X memory reagions.
> The 90s called, they want their shellcode back.

Just to clarify: the kernel must never create W+X memory regions. So,
no, do not reintroduce vmalloc_exec().

Dynamic code areas need to be constructed in a non-executable memory,
then switched to read-only and verified to still be what was expected,
and only then made executable.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-10  6:48         ` Eric Biggers
@ 2023-05-12 18:36           ` Kent Overstreet
  2023-05-13  1:57             ` Eric Biggers
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-12 18:36 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote:
> What seems to be missing is any explanation for what we're actually getting from
> this extremely unusual solution that cannot be gained any other way.  What is
> unique about bcachefs that it really needs something like this?

Ok, as promised:

Background: all metadata in bcachefs is a structured as key/value pairs,
and there's a common key format for all keys.

struct bkey {
	/* 3 byte header */
	u8		u64s;		/* size of k/v in u64s */
	u8		format;		/* packed/unpacked, needs_whiteout */
	u8		type;		/* value type */
	u8		pad;

	/*
	 * Order of fields below is for little endian, they're in
	 * reverse order on big endian (and byte swabbed as necessary
	 * when reading foreign endian metadata)
	 * 
	 * Since field order matches byte order, the key can be treated
	 * as one large multi word integer for doing comparisons:
	 */
	u96		version;	/* nonces, send/recv support */
	u32		size;		/* size of extent keys */

	/* Below are the field used for ordering/comparison: */
	u32		snapshot;	
	u64		offset;
	u64		inode;

	/* Value is stored inline with key */
	struct bch_val	v;
};

sizeof(struct bkey) == 40.

An extent value that has one pointer and no checksum is 8 bytes, with
one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key
included).

But for a given btree node, most of the key fields will typically be
redundandant. An extents leaf node might have extents for all one inode
number or a small range of inode numbers, snapshots may or may not be in
use, etc. - clearly some compression is desirable here.

The key observation is that key compression is possible if we have a
compression function that preserves order, and an order-preserving
compression function is possible if it's allowed to fail. That means we
do comparisons on packed keys, which lets us skip _most_ unpack
operations, for btree node resorts and for lookups within a node.

Packing works by defining a format with an offset and a bit width for
each field, so e.g. if all keys in a btree node have the same inode
number the packed format can specify that inode number and then a field
width of 0 bits for the inode field.

Continuing the arithmetic from before, a packed extent key will
typically be only 8 or 16 bytes, or 24-32 including the val, which means
bkey packing cuts our metadata size roughly in half.

(It also makes our key format somewhat self describing and gives us a
mechanism by which we could add or extend fields in the future).

-----------------------------------------------------

As mentioned before, since packed bkeys are still multi-word integers we
can do some important operations without unpacking, but to iterate over
keys, compare packed & unpacked keys in resort, etc. - we'll still need
to unpack, so we need this operation to be as fast as possible.

bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in
C, that works on any archictecture. It loops over the fields in a
bkey_format, pulling them out of the input words and adding back the
field offsets. It's got the absolute minimum number of branches - one
per field, when deciding to advance to the next input word - but it
can't be branchless and it's a whole ton of shifts and bitops.

dynamic codegen lets us produce unpack functions that are fully
branchless and _much_ smaller. For any given btree node we'll have a
format where multiple fields have 0 field with - i.e. those fields are
always constants. That code goes away, and also if the format can be
byte aligned we can eliminate shifts and bitopts. Code size for the
dynamically compiled unpack functions is roughly 10% that of the
unspecialized C version.

I hope that addresses some of the "what is this even for" questions :)

Cheers,
Kent

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-11 22:28     ` Kees Cook
@ 2023-05-12 18:41       ` Kent Overstreet
  2023-05-16 21:02         ` Kees Cook
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-12 18:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Thu, May 11, 2023 at 03:28:40PM -0700, Kees Cook wrote:
> On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> > On 09.05.23 18:56, Kent Overstreet wrote:
> > > +/**
> > > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > > + * @size:	  allocation size
> > > + *
> > > + * Kernel-internal function to allocate enough pages to cover @size
> > > + * the page level allocator and map them into contiguous and
> > > + * executable kernel virtual space.
> > > + *
> > > + * For tight control over page level allocator and protection flags
> > > + * use __vmalloc() instead.
> > > + *
> > > + * Return: pointer to the allocated memory or %NULL on error
> > > + */
> > > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > > +{
> > > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > > +			NUMA_NO_NODE, __builtin_return_address(0));
> > > +}
> > > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> > 
> > Uh W+X memory reagions.
> > The 90s called, they want their shellcode back.
> 
> Just to clarify: the kernel must never create W+X memory regions. So,
> no, do not reintroduce vmalloc_exec().
> 
> Dynamic code areas need to be constructed in a non-executable memory,
> then switched to read-only and verified to still be what was expected,
> and only then made executable.

So if we're opening this up to the topic if what an acceptible API would
look like - how hard is this requirement?

The reason is that the functions we're constructing are only ~50 bytes,
so we don't want to be burning a full page per function (particularly
for the 64kb page architectures...)

If we were to build an allocator for sub-page dynamically constructed
code, we'd have to flip the whole page to W+X while copying in the new
function. But, we could construct it in non executable memory and then
hand it off to this new allocator to do the copy, which would also do
the page permission flipping.

It seem like this is something BPF might want eventually too, depending
on average BPF program size...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-10  8:59       ` Peter Zijlstra
  2023-05-10 20:38         ` Kent Overstreet
@ 2023-05-12 20:49         ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-12 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > > 
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > > 
> > > 
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > 
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> > 
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
> 
> Have you read the ww_mutex code? If not, please do so, it does similar
> things.
> 
> The way it gets around the self-nesting check is by using the nest_lock
> annotation, the acquire context itself also has a dep_map for this
> purpose.

So, spent some time plumbing this through the six lock code and seeing
how it'd work.

I like it in theory, it's got the right semantics and it would allow for
lockdep to check that we're not taking locks with more than one
btree_trans in the same thread. Unfortunately, we've got code paths that
are meant to be called from contexts that may or may not have a
btree_trans - and this is fine right now, because they just use
trylock(), but having to plumb nest_lock through is going to make a mess
of things.

(The relevant codepaths include shrinker callbacks, where we definitely
can not just init a new btree_trans, and also the btree node write path
which can be kicked off from all sorts of places).

Can we go with lockdep_set_no_check_recursion() for now? It's a pretty
small addition to the lockdep code.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-11 12:14   ` Jan Engelhardt
@ 2023-05-12 20:58     ` Kent Overstreet
  2023-05-12 22:39       ` Jan Engelhardt
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-12 20:58 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng

On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
> 
> On Tuesday 2023-05-09 18:56, Kent Overstreet wrote:
> >--- /dev/null
> >+++ b/include/linux/six.h
> >@@ -0,0 +1,210 @@
> >+ * There are also operations that take the lock type as a parameter, where the
> >+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
> >+ *
> >+ *   six_lock_type(lock, type)
> >+ *   six_unlock_type(lock, type)
> >+ *   six_relock(lock, type, seq)
> >+ *   six_trylock_type(lock, type)
> >+ *   six_trylock_convert(lock, from, to)
> >+ *
> >+ * A lock may be held multiple types by the same thread (for read or intent,
> 
> "multiple times"

Thanks, fixed

> >+// SPDX-License-Identifier: GPL-2.0
> 
> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
> please edit.

Where is that list?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-12 20:58     ` Kent Overstreet
@ 2023-05-12 22:39       ` Jan Engelhardt
  2023-05-12 23:26         ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Jan Engelhardt @ 2023-05-12 22:39 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng


On Friday 2023-05-12 22:58, Kent Overstreet wrote:
>On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
>> >+// SPDX-License-Identifier: GPL-2.0
>> 
>> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
>> please edit.
>
>Where is that list?

I just went to spdx.org and then chose "License List" from the
horizontal top bar menu.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-12 22:39       ` Jan Engelhardt
@ 2023-05-12 23:26         ` Kent Overstreet
  2023-05-12 23:49           ` Randy Dunlap
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-12 23:26 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng

On Sat, May 13, 2023 at 12:39:34AM +0200, Jan Engelhardt wrote:
> 
> On Friday 2023-05-12 22:58, Kent Overstreet wrote:
> >On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
> >> >+// SPDX-License-Identifier: GPL-2.0
> >> 
> >> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
> >> please edit.
> >
> >Where is that list?
> 
> I just went to spdx.org and then chose "License List" from the
> horizontal top bar menu.

Do we have anything more official? Quick grep through the source tree
says I'm following accepted usage.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-12 23:26         ` Kent Overstreet
@ 2023-05-12 23:49           ` Randy Dunlap
  2023-05-13  0:17             ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Randy Dunlap @ 2023-05-12 23:49 UTC (permalink / raw)
  To: Kent Overstreet, Jan Engelhardt
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng



On 5/12/23 16:26, Kent Overstreet wrote:
> On Sat, May 13, 2023 at 12:39:34AM +0200, Jan Engelhardt wrote:
>>
>> On Friday 2023-05-12 22:58, Kent Overstreet wrote:
>>> On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>
>>>> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
>>>> please edit.
>>>
>>> Where is that list?
>>
>> I just went to spdx.org and then chose "License List" from the
>> horizontal top bar menu.
> 
> Do we have anything more official? Quick grep through the source tree
> says I'm following accepted usage.

Documentation/process/license-rules.rst points to spdx.org for
further info.

or LICENSES/preferred/GPL-2.0 contains this:
Valid-License-Identifier: GPL-2.0
Valid-License-Identifier: GPL-2.0-only
Valid-License-Identifier: GPL-2.0+
Valid-License-Identifier: GPL-2.0-or-later
SPDX-URL: https://spdx.org/licenses/GPL-2.0.html


-- 
~Randy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-12 23:49           ` Randy Dunlap
@ 2023-05-13  0:17             ` Kent Overstreet
  2023-05-13  0:45               ` Eric Biggers
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-13  0:17 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Jan Engelhardt, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Waiman Long, Boqun Feng

On Fri, May 12, 2023 at 04:49:04PM -0700, Randy Dunlap wrote:
> 
> 
> On 5/12/23 16:26, Kent Overstreet wrote:
> > On Sat, May 13, 2023 at 12:39:34AM +0200, Jan Engelhardt wrote:
> >>
> >> On Friday 2023-05-12 22:58, Kent Overstreet wrote:
> >>> On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
> >>>>> +// SPDX-License-Identifier: GPL-2.0
> >>>>
> >>>> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
> >>>> please edit.
> >>>
> >>> Where is that list?
> >>
> >> I just went to spdx.org and then chose "License List" from the
> >> horizontal top bar menu.
> > 
> > Do we have anything more official? Quick grep through the source tree
> > says I'm following accepted usage.
> 
> Documentation/process/license-rules.rst points to spdx.org for
> further info.
> 
> or LICENSES/preferred/GPL-2.0 contains this:
> Valid-License-Identifier: GPL-2.0
> Valid-License-Identifier: GPL-2.0-only
> Valid-License-Identifier: GPL-2.0+
> Valid-License-Identifier: GPL-2.0-or-later
> SPDX-URL: https://spdx.org/licenses/GPL-2.0.html

Thanks, I'll leave it at GPL-2.0 then.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-13  0:17             ` Kent Overstreet
@ 2023-05-13  0:45               ` Eric Biggers
  2023-05-13  0:51                 ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-13  0:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Randy Dunlap, Jan Engelhardt, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Fri, May 12, 2023 at 08:17:23PM -0400, Kent Overstreet wrote:
> On Fri, May 12, 2023 at 04:49:04PM -0700, Randy Dunlap wrote:
> > 
> > 
> > On 5/12/23 16:26, Kent Overstreet wrote:
> > > On Sat, May 13, 2023 at 12:39:34AM +0200, Jan Engelhardt wrote:
> > >>
> > >> On Friday 2023-05-12 22:58, Kent Overstreet wrote:
> > >>> On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
> > >>>>> +// SPDX-License-Identifier: GPL-2.0
> > >>>>
> > >>>> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
> > >>>> please edit.
> > >>>
> > >>> Where is that list?
> > >>
> > >> I just went to spdx.org and then chose "License List" from the
> > >> horizontal top bar menu.
> > > 
> > > Do we have anything more official? Quick grep through the source tree
> > > says I'm following accepted usage.
> > 
> > Documentation/process/license-rules.rst points to spdx.org for
> > further info.
> > 
> > or LICENSES/preferred/GPL-2.0 contains this:
> > Valid-License-Identifier: GPL-2.0
> > Valid-License-Identifier: GPL-2.0-only
> > Valid-License-Identifier: GPL-2.0+
> > Valid-License-Identifier: GPL-2.0-or-later
> > SPDX-URL: https://spdx.org/licenses/GPL-2.0.html
> 
> Thanks, I'll leave it at GPL-2.0 then.

https://spdx.org/licenses/GPL-2.0.html says that GPL-2.0 is deprecated.  Its
replacement is https://spdx.org/licenses/GPL-2.0-only.html.  Yes, they mean the
same thing, but the new names were introduced to be clearer than the old ones.

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-13  0:45               ` Eric Biggers
@ 2023-05-13  0:51                 ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-13  0:51 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Randy Dunlap, Jan Engelhardt, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Fri, May 12, 2023 at 05:45:18PM -0700, Eric Biggers wrote:
> On Fri, May 12, 2023 at 08:17:23PM -0400, Kent Overstreet wrote:
> > On Fri, May 12, 2023 at 04:49:04PM -0700, Randy Dunlap wrote:
> > > 
> > > 
> > > On 5/12/23 16:26, Kent Overstreet wrote:
> > > > On Sat, May 13, 2023 at 12:39:34AM +0200, Jan Engelhardt wrote:
> > > >>
> > > >> On Friday 2023-05-12 22:58, Kent Overstreet wrote:
> > > >>> On Thu, May 11, 2023 at 02:14:08PM +0200, Jan Engelhardt wrote:
> > > >>>>> +// SPDX-License-Identifier: GPL-2.0
> > > >>>>
> > > >>>> The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
> > > >>>> please edit.
> > > >>>
> > > >>> Where is that list?
> > > >>
> > > >> I just went to spdx.org and then chose "License List" from the
> > > >> horizontal top bar menu.
> > > > 
> > > > Do we have anything more official? Quick grep through the source tree
> > > > says I'm following accepted usage.
> > > 
> > > Documentation/process/license-rules.rst points to spdx.org for
> > > further info.
> > > 
> > > or LICENSES/preferred/GPL-2.0 contains this:
> > > Valid-License-Identifier: GPL-2.0
> > > Valid-License-Identifier: GPL-2.0-only
> > > Valid-License-Identifier: GPL-2.0+
> > > Valid-License-Identifier: GPL-2.0-or-later
> > > SPDX-URL: https://spdx.org/licenses/GPL-2.0.html
> > 
> > Thanks, I'll leave it at GPL-2.0 then.
> 
> https://spdx.org/licenses/GPL-2.0.html says that GPL-2.0 is deprecated.  Its
> replacement is https://spdx.org/licenses/GPL-2.0-only.html.  Yes, they mean the
> same thing, but the new names were introduced to be clearer than the old ones.

Perhaps updating LICENCES/preferred/GPL-2.0 is in order.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-12 18:36           ` Kent Overstreet
@ 2023-05-13  1:57             ` Eric Biggers
  2023-05-13 19:28               ` Kent Overstreet
  2023-05-14  5:45               ` Kent Overstreet
  0 siblings, 2 replies; 186+ messages in thread
From: Eric Biggers @ 2023-05-13  1:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

Hi Kent,

On Fri, May 12, 2023 at 02:36:13PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote:
> > What seems to be missing is any explanation for what we're actually getting from
> > this extremely unusual solution that cannot be gained any other way.  What is
> > unique about bcachefs that it really needs something like this?
> 
> Ok, as promised:
> 
> Background: all metadata in bcachefs is a structured as key/value pairs,
> and there's a common key format for all keys.
> 
> struct bkey {
> 	/* 3 byte header */
> 	u8		u64s;		/* size of k/v in u64s */
> 	u8		format;		/* packed/unpacked, needs_whiteout */
> 	u8		type;		/* value type */
> 	u8		pad;
> 
> 	/*
> 	 * Order of fields below is for little endian, they're in
> 	 * reverse order on big endian (and byte swabbed as necessary
> 	 * when reading foreign endian metadata)
> 	 * 
> 	 * Since field order matches byte order, the key can be treated
> 	 * as one large multi word integer for doing comparisons:
> 	 */
> 	u96		version;	/* nonces, send/recv support */
> 	u32		size;		/* size of extent keys */
> 
> 	/* Below are the field used for ordering/comparison: */
> 	u32		snapshot;	
> 	u64		offset;
> 	u64		inode;
> 
> 	/* Value is stored inline with key */
> 	struct bch_val	v;
> };
> 
> sizeof(struct bkey) == 40.
> 
> An extent value that has one pointer and no checksum is 8 bytes, with
> one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key
> included).
> 
> But for a given btree node, most of the key fields will typically be
> redundandant. An extents leaf node might have extents for all one inode
> number or a small range of inode numbers, snapshots may or may not be in
> use, etc. - clearly some compression is desirable here.
> 
> The key observation is that key compression is possible if we have a
> compression function that preserves order, and an order-preserving
> compression function is possible if it's allowed to fail. That means we
> do comparisons on packed keys, which lets us skip _most_ unpack
> operations, for btree node resorts and for lookups within a node.
> 
> Packing works by defining a format with an offset and a bit width for
> each field, so e.g. if all keys in a btree node have the same inode
> number the packed format can specify that inode number and then a field
> width of 0 bits for the inode field.
> 
> Continuing the arithmetic from before, a packed extent key will
> typically be only 8 or 16 bytes, or 24-32 including the val, which means
> bkey packing cuts our metadata size roughly in half.
> 
> (It also makes our key format somewhat self describing and gives us a
> mechanism by which we could add or extend fields in the future).
> 
> -----------------------------------------------------
> 
> As mentioned before, since packed bkeys are still multi-word integers we
> can do some important operations without unpacking, but to iterate over
> keys, compare packed & unpacked keys in resort, etc. - we'll still need
> to unpack, so we need this operation to be as fast as possible.
> 
> bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in
> C, that works on any archictecture. It loops over the fields in a
> bkey_format, pulling them out of the input words and adding back the
> field offsets. It's got the absolute minimum number of branches - one
> per field, when deciding to advance to the next input word - but it
> can't be branchless and it's a whole ton of shifts and bitops.
> 
> dynamic codegen lets us produce unpack functions that are fully
> branchless and _much_ smaller. For any given btree node we'll have a
> format where multiple fields have 0 field with - i.e. those fields are
> always constants. That code goes away, and also if the format can be
> byte aligned we can eliminate shifts and bitopts. Code size for the
> dynamically compiled unpack functions is roughly 10% that of the
> unspecialized C version.
> 
> I hope that addresses some of the "what is this even for" questions :)
> 
> Cheers,
> Kent

I don't think this response addresses all the possibilities for optimizing the C
implementation, so I'd like to bring up a few and make sure that you've explored
them.

To summarize, you need to decode 6 fields that are each a variable number of
bits (not known at compile time), and add an offset (also not known at compile
time) to each field.

I don't think the offset is particularly interesting.  Adding an offset to each
field is very cheap and trivially parallelizable by the CPU.

It's really the bit width that's "interesting", as it must be the serialized
decoding of variable-length fields that slows things down a lot.

First, I wanted to mention that decoding of variable-length fields has been
extensively studied for decompression algorithms, e.g. for Huffman decoding.
And it turns out that it can be done branchlessly.  The basic idea is that you
have a branchless refill step that looks like the following:

#define REFILL_BITS_BRANCHLESS()                    \
        bitbuf |= get_unaligned_u64(p) << bitsleft; \
        p += 7 - ((bitsleft >> 3) & 0x7);           \
        bitsleft |= 56;

That branchlessly ensures that 'bitbuf' contains '56 <= bitsleft <= 63' bits.
Then, the needed number of bits can be removed and returned:

#define READ_BITS(n)                          \
        REFILL_BITS_BRANCHLESS();             \
        tmp = bitbuf & (((u64)1 << (n)) - 1); \
        bitbuf >>= (n);                       \
        bitsleft -= (n);                      \
        tmp

If you're interested, I can give you some references about the above method.
But, I really just wanted to mention it for completeness, since I think you'd
actually want to go in a slightly different direction, since (a) you have all
the field widths available from the beginning, as opposed to being interleaved
into the bitstream itself (as is the case in Huffman decoding for example), so
you're not limited to serialized decoding of each field, (b) your fields are up
to 96 bits, and (c) you've selected a bitstream convention that seems to make it
such that your stream *must* be read in aligned units of u64, so I don't think
something like REFILL_BITS_BRANCHLESS() could work for you anyway.

What I would suggest instead is preprocessing the list of 6 field lengths to
create some information that can be used to extract all 6 fields branchlessly
with no dependencies between different fields.  (And you clearly *can* add a
preprocessing step, as you already have one -- the dynamic code generator.)

So, something like the following:

    const struct field_info *info = &format->fields[0];

    field0 = (in->u64s[info->word_idx] >> info->shift1) & info->mask;
    field0 |= in->u64s[info->word_idx - 1] >> info->shift2;

... but with the code for all 6 fields interleaved.

On modern CPUs, I think that would be faster than your current C code.

You could do better by creating variants that are specialized for specific
common sets of parameters.  During "preprocessing", you would select a variant
and set an enum accordingly.  During decoding, you would switch on that enum and
call the appropriate variant.  (This could also be done with a function pointer,
of course, but indirect calls are slow these days...)

For example, you mentioned that 8-byte packed keys is a common case.  In that
case there is only a single u64 to decode from, so you could create a function
that just handles that case:

    field0 = (word >> info->shift) & info->mask;

You could also create other variants, e.g.:

- 16-byte packed keys (which you mentioned are common)
- Some specific set of fields have zero width so don't need to be extracted
  (which it sounds like is common, or is it different fields each time?)
- All fields having specific lengths (are there any particularly common cases?)

Have you considered any of these ideas?

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-09 21:29       ` Kent Overstreet
  2023-05-09 21:43       ` Darrick J. Wong
@ 2023-05-13 13:25       ` Lorenzo Stoakes
  2023-05-14 18:39         ` Christophe Leroy
  2 siblings, 1 reply; 186+ messages in thread
From: Lorenzo Stoakes @ 2023-05-13 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > >
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> >
> > No, we will never add back a way for random code allocating executable
> > memory in kernel space.
>
> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> reinstating a helper function because the code is now used in more than one
> place (at lsf/mm so a little distracted :)
>
> But it being exported is a problem. Perhaps there's another way of acheving the
> same aim without having to do so?

Just to be abundantly clear, my original ack was a mistake (I overlooked
the _exporting_ of the function being as significant as it is and assumed
in an LSF/MM haze that it was simply a refactoring of _already available_
functionality rather than newly providing a means to allocate directly
executable kernel memory).

Exporting this is horrible for the numerous reasons expounded on in this
thread, we need a different solution.

Nacked-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-13  1:57             ` Eric Biggers
@ 2023-05-13 19:28               ` Kent Overstreet
  2023-05-14  5:45               ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-13 19:28 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Fri, May 12, 2023 at 06:57:52PM -0700, Eric Biggers wrote:
> What I would suggest instead is preprocessing the list of 6 field lengths to
> create some information that can be used to extract all 6 fields branchlessly
> with no dependencies between different fields.  (And you clearly *can* add a
> preprocessing step, as you already have one -- the dynamic code generator.)
> 
> So, something like the following:
> 
>     const struct field_info *info = &format->fields[0];
> 
>     field0 = (in->u64s[info->word_idx] >> info->shift1) & info->mask;
>     field0 |= in->u64s[info->word_idx - 1] >> info->shift2;
> 
> ... but with the code for all 6 fields interleaved.
> 
> On modern CPUs, I think that would be faster than your current C code.
> 
> You could do better by creating variants that are specialized for specific
> common sets of parameters.  During "preprocessing", you would select a variant
> and set an enum accordingly.  During decoding, you would switch on that enum and
> call the appropriate variant.  (This could also be done with a function pointer,
> of course, but indirect calls are slow these days...)
> 
> For example, you mentioned that 8-byte packed keys is a common case.  In that
> case there is only a single u64 to decode from, so you could create a function
> that just handles that case:
> 
>     field0 = (word >> info->shift) & info->mask;
> 
> You could also create other variants, e.g.:
> 
> - 16-byte packed keys (which you mentioned are common)
> - Some specific set of fields have zero width so don't need to be extracted
>   (which it sounds like is common, or is it different fields each time?)
> - All fields having specific lengths (are there any particularly common cases?)
> 
> Have you considered any of these ideas?

I like that idea.

Gonna hack some code... :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-13  1:57             ` Eric Biggers
  2023-05-13 19:28               ` Kent Overstreet
@ 2023-05-14  5:45               ` Kent Overstreet
  2023-05-14 18:43                 ` Eric Biggers
  2023-05-15 10:29                 ` David Laight
  1 sibling, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-14  5:45 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Fri, May 12, 2023 at 06:57:52PM -0700, Eric Biggers wrote:
> First, I wanted to mention that decoding of variable-length fields has been
> extensively studied for decompression algorithms, e.g. for Huffman decoding.
> And it turns out that it can be done branchlessly.  The basic idea is that you
> have a branchless refill step that looks like the following:
> 
> #define REFILL_BITS_BRANCHLESS()                    \
>         bitbuf |= get_unaligned_u64(p) << bitsleft; \
>         p += 7 - ((bitsleft >> 3) & 0x7);           \
>         bitsleft |= 56;
> 
> That branchlessly ensures that 'bitbuf' contains '56 <= bitsleft <= 63' bits.
> Then, the needed number of bits can be removed and returned:
> 
> #define READ_BITS(n)                          \
>         REFILL_BITS_BRANCHLESS();             \
>         tmp = bitbuf & (((u64)1 << (n)) - 1); \
>         bitbuf >>= (n);                       \
>         bitsleft -= (n);                      \
>         tmp
> 
> If you're interested, I can give you some references about the above method.

I might be interested in those references, new bit tricks and integer
encodings are always fun :)

> But, I really just wanted to mention it for completeness, since I think you'd
> actually want to go in a slightly different direction, since (a) you have all
> the field widths available from the beginning, as opposed to being interleaved
> into the bitstream itself (as is the case in Huffman decoding for example), so
> you're not limited to serialized decoding of each field, (b) your fields are up
> to 96 bits, and (c) you've selected a bitstream convention that seems to make it
> such that your stream *must* be read in aligned units of u64, so I don't think
> something like REFILL_BITS_BRANCHLESS() could work for you anyway.
> 
> What I would suggest instead is preprocessing the list of 6 field lengths to
> create some information that can be used to extract all 6 fields branchlessly
> with no dependencies between different fields.  (And you clearly *can* add a
> preprocessing step, as you already have one -- the dynamic code generator.)
> 
> So, something like the following:
> 
>     const struct field_info *info = &format->fields[0];
> 
>     field0 = (in->u64s[info->word_idx] >> info->shift1) & info->mask;
>     field0 |= in->u64s[info->word_idx - 1] >> info->shift2;
> 
> ... but with the code for all 6 fields interleaved.
> 
> On modern CPUs, I think that would be faster than your current C code.
> 
> You could do better by creating variants that are specialized for specific
> common sets of parameters.  During "preprocessing", you would select a variant
> and set an enum accordingly.  During decoding, you would switch on that enum and
> call the appropriate variant.  (This could also be done with a function pointer,
> of course, but indirect calls are slow these days...)

testing random btree updates:

dynamically generated unpack:
rand_insert: 20.0 MiB with 1 threads in    33 sec,  1609 nsec per iter, 607 KiB per sec

old C unpack:
rand_insert: 20.0 MiB with 1 threads in    35 sec,  1672 nsec per iter, 584 KiB per sec

the Eric Biggers special:
rand_insert: 20.0 MiB with 1 threads in    35 sec,  1676 nsec per iter, 583 KiB per sec

Tested two versions of your approach, one without a shift value, one
where we use a shift value to try to avoid unaligned access - second was
perhaps 1% faster

so it's not looking good. This benchmark doesn't even hit on
unpack_key() quite as much as I thought, so the difference is
significant.

diff --git a/fs/bcachefs/bkey.c b/fs/bcachefs/bkey.c
index 6d3a1c096f..128d96766c 100644
--- a/fs/bcachefs/bkey.c
+++ b/fs/bcachefs/bkey.c
@@ -7,6 +7,8 @@
 #include "bset.h"
 #include "util.h"
 
+#include <asm/unaligned.h>
+
 #undef EBUG_ON
 
 #ifdef DEBUG_BKEYS
@@ -19,9 +21,35 @@ const struct bkey_format bch2_bkey_format_current = BKEY_FORMAT_CURRENT;
 
 struct bkey_format_processed bch2_bkey_format_postprocess(const struct bkey_format f)
 {
-	return (struct bkey_format_processed) {
-		.f = f,
-	};
+	struct bkey_format_processed ret = { .f = f, .aligned = true };
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	unsigned offset = f.key_u64s * 64;
+#else
+	unsigned offset = KEY_PACKED_BITS_START;
+#endif
+
+	for (unsigned i = 0; i < BKEY_NR_FIELDS; i++) {
+		unsigned bits = f.bits_per_field[i];
+
+		if (bits & 7) {
+			ret.aligned = false;
+			break;
+		}
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+		offset -= bits;
+#endif
+
+		ret.shift[i]	= min(offset & 63, 64 - bits);
+		ret.offset[i]	= (offset - ret.shift[i]) / 8;
+		ret.mask[i]	= bits ? ~0ULL >> (64 - bits) : 0;
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+		offset += bits;
+#endif
+	}
+
+	return ret;
 }
 
 void bch2_bkey_packed_to_binary_text(struct printbuf *out,
@@ -191,6 +219,19 @@ static u64 get_inc_field(struct unpack_state *state, unsigned field)
 	return v + offset;
 }
 
+__always_inline
+static u64 get_aligned_field(const struct bkey_format_processed *f,
+			     const struct bkey_packed *in,
+			     unsigned field_idx)
+{
+	u64 v = get_unaligned((u64 *) (((u8 *) in->_data) + f->offset[field_idx]));
+
+	v >>= f->shift[field_idx];
+	v &= f->mask[field_idx];
+
+	return v + le64_to_cpu(f->f.field_offset[field_idx]);
+}
+
 __always_inline
 static bool set_inc_field(struct pack_state *state, unsigned field, u64 v)
 {
@@ -269,45 +310,57 @@ bool bch2_bkey_transform(const struct bkey_format *out_f,
 	return true;
 }
 
-struct bkey __bch2_bkey_unpack_key(const struct bkey_format_processed *format_p,
+struct bkey __bch2_bkey_unpack_key(const struct bkey_format_processed *format,
 				   const struct bkey_packed *in)
 {
-	const struct bkey_format *format = &format_p->f;
-	struct unpack_state state = unpack_state_init(format, in);
 	struct bkey out;
 
-	EBUG_ON(format->nr_fields != BKEY_NR_FIELDS);
-	EBUG_ON(in->u64s < format->key_u64s);
+	EBUG_ON(format->f.nr_fields != BKEY_NR_FIELDS);
+	EBUG_ON(in->u64s < format->f.key_u64s);
 	EBUG_ON(in->format != KEY_FORMAT_LOCAL_BTREE);
-	EBUG_ON(in->u64s - format->key_u64s + BKEY_U64s > U8_MAX);
+	EBUG_ON(in->u64s - format->f.key_u64s + BKEY_U64s > U8_MAX);
 
-	out.u64s	= BKEY_U64s + in->u64s - format->key_u64s;
+	out.u64s	= BKEY_U64s + in->u64s - format->f.key_u64s;
 	out.format	= KEY_FORMAT_CURRENT;
 	out.needs_whiteout = in->needs_whiteout;
 	out.type	= in->type;
 	out.pad[0]	= 0;
 
+	if (likely(format->aligned)) {
+#define x(id, field)	out.field = get_aligned_field(format, in, id);
+		bkey_fields()
+#undef x
+	} else {
+		struct unpack_state state = unpack_state_init(&format->f, in);
+
 #define x(id, field)	out.field = get_inc_field(&state, id);
-	bkey_fields()
+		bkey_fields()
 #undef x
+	}
 
 	return out;
 }
 
-struct bpos __bkey_unpack_pos(const struct bkey_format_processed *format_p,
+struct bpos __bkey_unpack_pos(const struct bkey_format_processed *format,
 			      const struct bkey_packed *in)
 {
-	const struct bkey_format *format = &format_p->f;
-	struct unpack_state state = unpack_state_init(format, in);
 	struct bpos out;
 
-	EBUG_ON(format->nr_fields != BKEY_NR_FIELDS);
-	EBUG_ON(in->u64s < format->key_u64s);
+	EBUG_ON(format->f.nr_fields != BKEY_NR_FIELDS);
+	EBUG_ON(in->u64s < format->f.key_u64s);
 	EBUG_ON(in->format != KEY_FORMAT_LOCAL_BTREE);
 
-	out.inode	= get_inc_field(&state, BKEY_FIELD_INODE);
-	out.offset	= get_inc_field(&state, BKEY_FIELD_OFFSET);
-	out.snapshot	= get_inc_field(&state, BKEY_FIELD_SNAPSHOT);
+	if (likely(format->aligned)) {
+		out.inode	= get_aligned_field(format, in, BKEY_FIELD_INODE);
+		out.offset	= get_aligned_field(format, in, BKEY_FIELD_OFFSET);
+		out.snapshot	= get_aligned_field(format, in, BKEY_FIELD_SNAPSHOT);
+	} else {
+		struct unpack_state state = unpack_state_init(&format->f, in);
+
+		out.inode	= get_inc_field(&state, BKEY_FIELD_INODE);
+		out.offset	= get_inc_field(&state, BKEY_FIELD_OFFSET);
+		out.snapshot	= get_inc_field(&state, BKEY_FIELD_SNAPSHOT);
+	}
 
 	return out;
 }
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index 58ce60c37e..38c3ec6852 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -70,6 +70,10 @@ struct btree_bkey_cached_common {
 
 struct bkey_format_processed {
 	struct bkey_format	f;
+	bool			aligned;
+	u8			offset[6];
+	u8			shift[6];
+	u64			mask[6];
 };
 
 struct btree {
diff --git a/fs/bcachefs/btree_update_interior.h b/fs/bcachefs/btree_update_interior.h
index dcfd7ceacc..72aedc1e34 100644
--- a/fs/bcachefs/btree_update_interior.h
+++ b/fs/bcachefs/btree_update_interior.h
@@ -181,7 +181,11 @@ static inline void btree_node_reset_sib_u64s(struct btree *b)
 
 static inline void *btree_data_end(struct bch_fs *c, struct btree *b)
 {
-	return (void *) b->data + btree_bytes(c);
+	/*
+	 * __bch2_bkey_unpack_key() may read up to 8 bytes past the end of the
+	 * input buffer:
+	 */
+	return (void *) b->data + btree_bytes(c) - 8;
 }
 
 static inline struct bkey_packed *unwritten_whiteouts_start(struct bch_fs *c,

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
  2023-05-11 12:14   ` Jan Engelhardt
@ 2023-05-14 12:15   ` Jeff Layton
  2023-05-15  2:39     ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Jeff Layton @ 2023-05-14 12:15 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Waiman Long, Boqun Feng

On Tue, 2023-05-09 at 12:56 -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> New lock for bcachefs, like read/write locks but with a third state,
> intent.
> 
> Intent locks conflict with each other, but not with read locks; taking a
> write lock requires first holding an intent lock.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>  include/linux/six.h     | 210 +++++++++++
>  kernel/Kconfig.locks    |   3 +
>  kernel/locking/Makefile |   1 +
>  kernel/locking/six.c    | 779 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 993 insertions(+)
>  create mode 100644 include/linux/six.h
>  create mode 100644 kernel/locking/six.c
> 
> diff --git a/include/linux/six.h b/include/linux/six.h
> new file mode 100644
> index 0000000000..41ddf63b74
> --- /dev/null
> +++ b/include/linux/six.h
> @@ -0,0 +1,210 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_SIX_H
> +#define _LINUX_SIX_H
> +
> +/*
> + * Shared/intent/exclusive locks: sleepable read/write locks, much like rw
> + * semaphores, except with a third intermediate state, intent. Basic operations
> + * are:
> + *
> + * six_lock_read(&foo->lock);
> + * six_unlock_read(&foo->lock);
> + *
> + * six_lock_intent(&foo->lock);
> + * six_unlock_intent(&foo->lock);
> + *
> + * six_lock_write(&foo->lock);
> + * six_unlock_write(&foo->lock);
> + *
> + * Intent locks block other intent locks, but do not block read locks, and you
> + * must have an intent lock held before taking a write lock, like so:
> + *
> + * six_lock_intent(&foo->lock);
> + * six_lock_write(&foo->lock);
> + * six_unlock_write(&foo->lock);
> + * six_unlock_intent(&foo->lock);
> + *

So the idea is to create a fundamentally unfair rwsem? One that always
prefers readers over writers?

> + * Other operations:
> + *
> + *   six_trylock_read()
> + *   six_trylock_intent()
> + *   six_trylock_write()
> + *
> + *   six_lock_downgrade():	convert from intent to read
> + *   six_lock_tryupgrade():	attempt to convert from read to intent
> + *
> + * Locks also embed a sequence number, which is incremented when the lock is
> + * locked or unlocked for write. The current sequence number can be grabbed
> + * while a lock is held from lock->state.seq; then, if you drop the lock you can
> + * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
> + * iff it hasn't been locked for write in the meantime.
> + *

^^^
This is a cool idea.

> + * There are also operations that take the lock type as a parameter, where the
> + * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
> + *
> + *   six_lock_type(lock, type)
> + *   six_unlock_type(lock, type)
> + *   six_relock(lock, type, seq)
> + *   six_trylock_type(lock, type)
> + *   six_trylock_convert(lock, from, to)
> + *
> + * A lock may be held multiple types by the same thread (for read or intent,
> + * not write). However, the six locks code does _not_ implement the actual
> + * recursive checks itself though - rather, if your code (e.g. btree iterator
> + * code) knows that the current thread already has a lock held, and for the
> + * correct type, six_lock_increment() may be used to bump up the counter for
> + * that type - the only effect is that one more call to unlock will be required
> + * before the lock is unlocked.

Thse semantics are a bit confusing. Once you hold a read or intent lock,
you can take it as many times as you like. What happens if I take it in
one context and release it in another? Say, across a workqueue job for
instance?

Are intent locks "converted" to write locks, or do they stack? For
instance, suppose I take the intent lock 3 times and then take a write
lock. How many times do I have to call unlock to fully release it (3 or
4)? If I release it just once, do I still hold the write lock or am I
back to "intent" state?


> + */
> +
> +#include <linux/lockdep.h>
> +#include <linux/osq_lock.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +
> +#define SIX_LOCK_SEPARATE_LOCKFNS
> 
> 

Some basic info about the underlying design would be nice here. What
info is tracked in the union below? When are different members being
used? How does the code decide which way to cast this thing? etc.


> +
> +union six_lock_state {
> +	struct {
> +		atomic64_t	counter;
> +	};
> +
> +	struct {
> +		u64		v;
> +	};
> +
> +	struct {
> +		/* for waitlist_bitnr() */
> +		unsigned long	l;
> +	};
> +
> +	struct {
> +		unsigned	read_lock:27;
> +		unsigned	write_locking:1;
> +		unsigned	intent_lock:1;
> +		unsigned	waiters:3;


Ewww...bitfields. That seems a bit scary in a union. There is no
guarantee that the underlying arch will even pack that into a single
word, AIUI. It may be safer to do this with masking and shifting
instead.

> +		/*
> +		 * seq works much like in seqlocks: it's incremented every time
> +		 * we lock and unlock for write.
> +		 *
> +		 * If it's odd write lock is held, even unlocked.
> +		 *
> +		 * Thus readers can unlock, and then lock again later iff it
> +		 * hasn't been modified in the meantime.
> +		 */
> +		u32		seq;
> +	};
> +};
> +
> +enum six_lock_type {
> +	SIX_LOCK_read,
> +	SIX_LOCK_intent,
> +	SIX_LOCK_write,
> +};
> +
> +struct six_lock {
> +	union six_lock_state	state;
> +	unsigned		intent_lock_recurse;
> +	struct task_struct	*owner;
> +	struct optimistic_spin_queue osq;
> +	unsigned __percpu	*readers;
> +
> +	raw_spinlock_t		wait_lock;
> +	struct list_head	wait_list[2];
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map	dep_map;
> +#endif
> +};
> +
> +typedef int (*six_lock_should_sleep_fn)(struct six_lock *lock, void *);
> +
> +static __always_inline void __six_lock_init(struct six_lock *lock,
> +					    const char *name,
> +					    struct lock_class_key *key)
> +{
> +	atomic64_set(&lock->state.counter, 0);
> +	raw_spin_lock_init(&lock->wait_lock);
> +	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
> +	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
> +	lockdep_init_map(&lock->dep_map, name, key, 0);
> +#endif
> +}
> +
> +#define six_lock_init(lock)						\
> +do {									\
> +	static struct lock_class_key __key;				\
> +									\
> +	__six_lock_init((lock), #lock, &__key);				\
> +} while (0)
> +
> +#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
> +
> +#define __SIX_LOCK(type)						\
> +bool six_trylock_##type(struct six_lock *);				\
> +bool six_relock_##type(struct six_lock *, u32);				\
> +int six_lock_##type(struct six_lock *, six_lock_should_sleep_fn, void *);\
> +void six_unlock_##type(struct six_lock *);
> +
> +__SIX_LOCK(read)
> +__SIX_LOCK(intent)
> +__SIX_LOCK(write)
> +#undef __SIX_LOCK
> +
> +#define SIX_LOCK_DISPATCH(type, fn, ...)			\
> +	switch (type) {						\
> +	case SIX_LOCK_read:					\
> +		return fn##_read(__VA_ARGS__);			\
> +	case SIX_LOCK_intent:					\
> +		return fn##_intent(__VA_ARGS__);		\
> +	case SIX_LOCK_write:					\
> +		return fn##_write(__VA_ARGS__);			\
> +	default:						\
> +		BUG();						\
> +	}
> +
> +static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
> +{
> +	SIX_LOCK_DISPATCH(type, six_trylock, lock);
> +}
> +
> +static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
> +				   unsigned seq)
> +{
> +	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
> +}
> +
> +static inline int six_lock_type(struct six_lock *lock, enum six_lock_type type,
> +				six_lock_should_sleep_fn should_sleep_fn, void *p)
> +{
> +	SIX_LOCK_DISPATCH(type, six_lock, lock, should_sleep_fn, p);
> +}
> +
> +static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
> +{
> +	SIX_LOCK_DISPATCH(type, six_unlock, lock);
> +}
> +
> +void six_lock_downgrade(struct six_lock *);
> +bool six_lock_tryupgrade(struct six_lock *);
> +bool six_trylock_convert(struct six_lock *, enum six_lock_type,
> +			 enum six_lock_type);
> +
> +void six_lock_increment(struct six_lock *, enum six_lock_type);
> +
> +void six_lock_wakeup_all(struct six_lock *);
> +
> +void six_lock_pcpu_free_rcu(struct six_lock *);
> +void six_lock_pcpu_free(struct six_lock *);
> +void six_lock_pcpu_alloc(struct six_lock *);
> +
> +struct six_lock_count {
> +	unsigned read;
> +	unsigned intent;
> +};
> +
> +struct six_lock_count six_lock_counts(struct six_lock *);
> +
> +#endif /* _LINUX_SIX_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 4198f0273e..b2abd9a5d9 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -259,3 +259,6 @@ config ARCH_HAS_MMIOWB
>  config MMIOWB
>  	def_bool y if ARCH_HAS_MMIOWB
>  	depends on SMP
> +
> +config SIXLOCKS
> +	bool
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 0db4093d17..a095dbbf01 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -32,3 +32,4 @@ obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
>  obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
>  obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
>  obj-$(CONFIG_LOCK_EVENT_COUNTS) += lock_events.o
> +obj-$(CONFIG_SIXLOCKS) += six.o
> diff --git a/kernel/locking/six.c b/kernel/locking/six.c
> new file mode 100644
> index 0000000000..5b2d92c6e9
> --- /dev/null
> +++ b/kernel/locking/six.c
> @@ -0,0 +1,779 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/export.h>
> +#include <linux/log2.h>
> +#include <linux/percpu.h>
> +#include <linux/preempt.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +#include <linux/sched/rt.h>
> +#include <linux/six.h>
> +#include <linux/slab.h>
> +
> +#ifdef DEBUG
> +#define EBUG_ON(cond)		BUG_ON(cond)
> +#else
> +#define EBUG_ON(cond)		do {} while (0)
> +#endif
> +
> +#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
> +#define six_release(l)		lock_release(l, _RET_IP_)
> +
> +struct six_lock_vals {
> +	/* Value we add to the lock in order to take the lock: */
> +	u64			lock_val;
> +
> +	/* If the lock has this value (used as a mask), taking the lock fails: */
> +	u64			lock_fail;
> +
> +	/* Value we add to the lock in order to release the lock: */
> +	u64			unlock_val;
> +
> +	/* Mask that indicates lock is held for this type: */
> +	u64			held_mask;
> +
> +	/* Waitlist we wakeup when releasing the lock: */
> +	enum six_lock_type	unlock_wakeup;
> +};
> +
> +#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
> +#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
> +#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
> +
> +#define LOCK_VALS {							\
> +	[SIX_LOCK_read] = {						\
> +		.lock_val	= __SIX_VAL(read_lock, 1),		\
> +		.lock_fail	= __SIX_LOCK_HELD_write + __SIX_VAL(write_locking, 1),\
> +		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
> +		.held_mask	= __SIX_LOCK_HELD_read,			\
> +		.unlock_wakeup	= SIX_LOCK_write,			\
> +	},								\
> +	[SIX_LOCK_intent] = {						\
> +		.lock_val	= __SIX_VAL(intent_lock, 1),		\
> +		.lock_fail	= __SIX_LOCK_HELD_intent,		\
> +		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
> +		.held_mask	= __SIX_LOCK_HELD_intent,		\
> +		.unlock_wakeup	= SIX_LOCK_intent,			\
> +	},								\
> +	[SIX_LOCK_write] = {						\
> +		.lock_val	= __SIX_VAL(seq, 1),			\
> +		.lock_fail	= __SIX_LOCK_HELD_read,			\
> +		.unlock_val	= __SIX_VAL(seq, 1),			\
> +		.held_mask	= __SIX_LOCK_HELD_write,		\
> +		.unlock_wakeup	= SIX_LOCK_read,			\
> +	},								\
> +}
> +
> +static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
> +				 union six_lock_state old)
> +{
> +	if (type != SIX_LOCK_intent)
> +		return;
> +
> +	if (!old.intent_lock) {
> +		EBUG_ON(lock->owner);
> +		lock->owner = current;
> +	} else {
> +		EBUG_ON(lock->owner != current);
> +	}
> +}
> +
> +static inline unsigned pcpu_read_count(struct six_lock *lock)
> +{
> +	unsigned read_count = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		read_count += *per_cpu_ptr(lock->readers, cpu);
> +	return read_count;
> +}
> +
> +struct six_lock_waiter {
> +	struct list_head	list;
> +	struct task_struct	*task;
> +};
> +
> +/* This is probably up there with the more evil things I've done */
> +#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
> +
> +static inline void six_lock_wakeup(struct six_lock *lock,
> +				   union six_lock_state state,
> +				   unsigned waitlist_id)
> +{
> +	if (waitlist_id == SIX_LOCK_write) {
> +		if (state.write_locking && !state.read_lock) {
> +			struct task_struct *p = READ_ONCE(lock->owner);
> +			if (p)
> +				wake_up_process(p);
> +		}
> +	} else {
> +		struct list_head *wait_list = &lock->wait_list[waitlist_id];
> +		struct six_lock_waiter *w, *next;
> +
> +		if (!(state.waiters & (1 << waitlist_id)))
> +			return;
> +
> +		clear_bit(waitlist_bitnr(waitlist_id),
> +			  (unsigned long *) &lock->state.v);
> +
> +		raw_spin_lock(&lock->wait_lock);
> +
> +		list_for_each_entry_safe(w, next, wait_list, list) {
> +			list_del_init(&w->list);
> +
> +			if (wake_up_process(w->task) &&
> +			    waitlist_id != SIX_LOCK_read) {
> +				if (!list_empty(wait_list))
> +					set_bit(waitlist_bitnr(waitlist_id),
> +						(unsigned long *) &lock->state.v);
> +				break;
> +			}
> +		}
> +
> +		raw_spin_unlock(&lock->wait_lock);
> +	}
> +}
> +
> +static __always_inline bool do_six_trylock_type(struct six_lock *lock,
> +						enum six_lock_type type,
> +						bool try)
> +{
> +	const struct six_lock_vals l[] = LOCK_VALS;
> +	union six_lock_state old, new;
> +	bool ret;
> +	u64 v;
> +
> +	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
> +	EBUG_ON(type == SIX_LOCK_write && (lock->state.seq & 1));
> +
> +	EBUG_ON(type == SIX_LOCK_write && (try != !(lock->state.write_locking)));
> +
> +	/*
> +	 * Percpu reader mode:
> +	 *
> +	 * The basic idea behind this algorithm is that you can implement a lock
> +	 * between two threads without any atomics, just memory barriers:
> +	 *
> +	 * For two threads you'll need two variables, one variable for "thread a
> +	 * has the lock" and another for "thread b has the lock".
> +	 *
> +	 * To take the lock, a thread sets its variable indicating that it holds
> +	 * the lock, then issues a full memory barrier, then reads from the
> +	 * other thread's variable to check if the other thread thinks it has
> +	 * the lock. If we raced, we backoff and retry/sleep.
> +	 */
> +
> +	if (type == SIX_LOCK_read && lock->readers) {
> +retry:
> +		preempt_disable();
> +		this_cpu_inc(*lock->readers); /* signal that we own lock */
> +
> +		smp_mb();
> +
> +		old.v = READ_ONCE(lock->state.v);
> +		ret = !(old.v & l[type].lock_fail);
> +
> +		this_cpu_sub(*lock->readers, !ret);
> +		preempt_enable();
> +
> +		/*
> +		 * If we failed because a writer was trying to take the
> +		 * lock, issue a wakeup because we might have caused a
> +		 * spurious trylock failure:
> +		 */
> +		if (old.write_locking) {
> +			struct task_struct *p = READ_ONCE(lock->owner);
> +
> +			if (p)
> +				wake_up_process(p);
> +		}
> +
> +		/*
> +		 * If we failed from the lock path and the waiting bit wasn't
> +		 * set, set it:
> +		 */
> +		if (!try && !ret) {
> +			v = old.v;
> +
> +			do {
> +				new.v = old.v = v;
> +
> +				if (!(old.v & l[type].lock_fail))
> +					goto retry;
> +
> +				if (new.waiters & (1 << type))
> +					break;
> +
> +				new.waiters |= 1 << type;
> +			} while ((v = atomic64_cmpxchg(&lock->state.counter,
> +						       old.v, new.v)) != old.v);
> +		}
> +	} else if (type == SIX_LOCK_write && lock->readers) {
> +		if (try) {
> +			atomic64_add(__SIX_VAL(write_locking, 1),
> +				     &lock->state.counter);
> +			smp_mb__after_atomic();
> +		}
> +
> +		ret = !pcpu_read_count(lock);
> +
> +		/*
> +		 * On success, we increment lock->seq; also we clear
> +		 * write_locking unless we failed from the lock path:
> +		 */
> +		v = 0;
> +		if (ret)
> +			v += __SIX_VAL(seq, 1);
> +		if (ret || try)
> +			v -= __SIX_VAL(write_locking, 1);
> +
> +		if (try && !ret) {
> +			old.v = atomic64_add_return(v, &lock->state.counter);
> +			six_lock_wakeup(lock, old, SIX_LOCK_read);
> +		} else {
> +			atomic64_add(v, &lock->state.counter);
> +		}
> +	} else {
> +		v = READ_ONCE(lock->state.v);
> +		do {
> +			new.v = old.v = v;
> +
> +			if (!(old.v & l[type].lock_fail)) {
> +				new.v += l[type].lock_val;
> +
> +				if (type == SIX_LOCK_write)
> +					new.write_locking = 0;
> +			} else if (!try && type != SIX_LOCK_write &&
> +				   !(new.waiters & (1 << type)))
> +				new.waiters |= 1 << type;
> +			else
> +				break; /* waiting bit already set */
> +		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
> +					old.v, new.v)) != old.v);
> +
> +		ret = !(old.v & l[type].lock_fail);
> +
> +		EBUG_ON(ret && !(lock->state.v & l[type].held_mask));
> +	}
> +
> +	if (ret)
> +		six_set_owner(lock, type, old);
> +
> +	EBUG_ON(type == SIX_LOCK_write && (try || ret) && (lock->state.write_locking));
> +
> +	return ret;
> +}
> +

^^^
I'd really like to see some more comments in the code above. It's pretty
complex.

> +__always_inline __flatten
> +static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
> +{
> +	if (!do_six_trylock_type(lock, type, true))
> +		return false;
> +
> +	if (type != SIX_LOCK_write)
> +		six_acquire(&lock->dep_map, 1);
> +	return true;
> +}
> +
> +__always_inline __flatten
> +static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
> +			      unsigned seq)
> +{
> +	const struct six_lock_vals l[] = LOCK_VALS;
> +	union six_lock_state old;
> +	u64 v;
> +
> +	EBUG_ON(type == SIX_LOCK_write);
> +
> +	if (type == SIX_LOCK_read &&
> +	    lock->readers) {
> +		bool ret;
> +
> +		preempt_disable();
> +		this_cpu_inc(*lock->readers);
> +
> +		smp_mb();
> +
> +		old.v = READ_ONCE(lock->state.v);
> +		ret = !(old.v & l[type].lock_fail) && old.seq == seq;
> +
> +		this_cpu_sub(*lock->readers, !ret);
> +		preempt_enable();
> +
> +		/*
> +		 * Similar to the lock path, we may have caused a spurious write
> +		 * lock fail and need to issue a wakeup:
> +		 */
> +		if (old.write_locking) {
> +			struct task_struct *p = READ_ONCE(lock->owner);
> +
> +			if (p)
> +				wake_up_process(p);
> +		}
> +
> +		if (ret)
> +			six_acquire(&lock->dep_map, 1);
> +
> +		return ret;
> +	}
> +
> +	v = READ_ONCE(lock->state.v);
> +	do {
> +		old.v = v;
> +
> +		if (old.seq != seq || old.v & l[type].lock_fail)
> +			return false;
> +	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
> +				old.v,
> +				old.v + l[type].lock_val)) != old.v);
> +
> +	six_set_owner(lock, type, old);
> +	if (type != SIX_LOCK_write)
> +		six_acquire(&lock->dep_map, 1);
> +	return true;
> +}
> +
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER
> +
> +static inline int six_can_spin_on_owner(struct six_lock *lock)
> +{
> +	struct task_struct *owner;
> +	int retval = 1;
> +
> +	if (need_resched())
> +		return 0;
> +
> +	rcu_read_lock();
> +	owner = READ_ONCE(lock->owner);
> +	if (owner)
> +		retval = owner->on_cpu;
> +	rcu_read_unlock();
> +	/*
> +	 * if lock->owner is not set, the mutex owner may have just acquired
> +	 * it and not set the owner yet or the mutex has been released.
> +	 */
> +	return retval;
> +}
> +
> +static inline bool six_spin_on_owner(struct six_lock *lock,
> +				     struct task_struct *owner)
> +{
> +	bool ret = true;
> +
> +	rcu_read_lock();
> +	while (lock->owner == owner) {
> +		/*
> +		 * Ensure we emit the owner->on_cpu, dereference _after_
> +		 * checking lock->owner still matches owner. If that fails,
> +		 * owner might point to freed memory. If it still matches,
> +		 * the rcu_read_lock() ensures the memory stays valid.
> +		 */
> +		barrier();
> +
> +		if (!owner->on_cpu || need_resched()) {
> +			ret = false;
> +			break;
> +		}
> +
> +		cpu_relax();
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
> +{
> +	struct task_struct *task = current;
> +
> +	if (type == SIX_LOCK_write)
> +		return false;
> +
> +	preempt_disable();
> +	if (!six_can_spin_on_owner(lock))
> +		goto fail;
> +
> +	if (!osq_lock(&lock->osq))
> +		goto fail;
> +
> +	while (1) {
> +		struct task_struct *owner;
> +
> +		/*
> +		 * If there's an owner, wait for it to either
> +		 * release the lock or go to sleep.
> +		 */
> +		owner = READ_ONCE(lock->owner);
> +		if (owner && !six_spin_on_owner(lock, owner))
> +			break;
> +
> +		if (do_six_trylock_type(lock, type, false)) {
> +			osq_unlock(&lock->osq);
> +			preempt_enable();
> +			return true;
> +		}
> +
> +		/*
> +		 * When there's no owner, we might have preempted between the
> +		 * owner acquiring the lock and setting the owner field. If
> +		 * we're an RT task that will live-lock because we won't let
> +		 * the owner complete.
> +		 */
> +		if (!owner && (need_resched() || rt_task(task)))
> +			break;
> +
> +		/*
> +		 * The cpu_relax() call is a compiler barrier which forces
> +		 * everything in this loop to be re-loaded. We don't need
> +		 * memory barriers as we'll eventually observe the right
> +		 * values at the cost of a few extra spins.
> +		 */
> +		cpu_relax();
> +	}
> +
> +	osq_unlock(&lock->osq);
> +fail:
> +	preempt_enable();
> +
> +	/*
> +	 * If we fell out of the spin path because of need_resched(),
> +	 * reschedule now, before we try-lock again. This avoids getting
> +	 * scheduled out right after we obtained the lock.
> +	 */
> +	if (need_resched())
> +		schedule();
> +
> +	return false;
> +}
> +
> +#else /* CONFIG_LOCK_SPIN_ON_OWNER */
> +
> +static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
> +{
> +	return false;
> +}
> +
> +#endif
> +
> +noinline
> +static int __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type,
> +				    six_lock_should_sleep_fn should_sleep_fn, void *p)
> +{
> +	union six_lock_state old;
> +	struct six_lock_waiter wait;
> +	int ret = 0;
> +
> +	if (type == SIX_LOCK_write) {
> +		EBUG_ON(lock->state.write_locking);
> +		atomic64_add(__SIX_VAL(write_locking, 1), &lock->state.counter);
> +		smp_mb__after_atomic();
> +	}
> +
> +	ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
> +	if (ret)
> +		goto out_before_sleep;
> +
> +	if (six_optimistic_spin(lock, type))
> +		goto out_before_sleep;
> +
> +	lock_contended(&lock->dep_map, _RET_IP_);
> +
> +	INIT_LIST_HEAD(&wait.list);
> +	wait.task = current;
> +
> +	while (1) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (type == SIX_LOCK_write)
> +			EBUG_ON(lock->owner != current);
> +		else if (list_empty_careful(&wait.list)) {
> +			raw_spin_lock(&lock->wait_lock);
> +			list_add_tail(&wait.list, &lock->wait_list[type]);
> +			raw_spin_unlock(&lock->wait_lock);
> +		}
> +
> +		if (do_six_trylock_type(lock, type, false))
> +			break;
> +
> +		ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
> +		if (ret)
> +			break;
> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +
> +	if (!list_empty_careful(&wait.list)) {
> +		raw_spin_lock(&lock->wait_lock);
> +		list_del_init(&wait.list);
> +		raw_spin_unlock(&lock->wait_lock);
> +	}
> +out_before_sleep:
> +	if (ret && type == SIX_LOCK_write) {
> +		old.v = atomic64_sub_return(__SIX_VAL(write_locking, 1),
> +					    &lock->state.counter);
> +		six_lock_wakeup(lock, old, SIX_LOCK_read);
> +	}
> +
> +	return ret;
> +}
> +
> +__always_inline
> +static int __six_lock_type(struct six_lock *lock, enum six_lock_type type,
> +			   six_lock_should_sleep_fn should_sleep_fn, void *p)
> +{
> +	int ret;
> +
> +	if (type != SIX_LOCK_write)
> +		six_acquire(&lock->dep_map, 0);
> +
> +	ret = do_six_trylock_type(lock, type, true) ? 0
> +		: __six_lock_type_slowpath(lock, type, should_sleep_fn, p);
> +
> +	if (ret && type != SIX_LOCK_write)
> +		six_release(&lock->dep_map);
> +	if (!ret)
> +		lock_acquired(&lock->dep_map, _RET_IP_);
> +
> +	return ret;
> +}
> +
> +__always_inline __flatten
> +static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
> +{
> +	const struct six_lock_vals l[] = LOCK_VALS;
> +	union six_lock_state state;
> +
> +	EBUG_ON(type == SIX_LOCK_write &&
> +		!(lock->state.v & __SIX_LOCK_HELD_intent));
> +
> +	if (type != SIX_LOCK_write)
> +		six_release(&lock->dep_map);
> +
> +	if (type == SIX_LOCK_intent) {
> +		EBUG_ON(lock->owner != current);
> +
> +		if (lock->intent_lock_recurse) {
> +			--lock->intent_lock_recurse;
> +			return;
> +		}
> +
> +		lock->owner = NULL;
> +	}
> +
> +	if (type == SIX_LOCK_read &&
> +	    lock->readers) {
> +		smp_mb(); /* unlock barrier */
> +		this_cpu_dec(*lock->readers);
> +		smp_mb(); /* between unlocking and checking for waiters */
> +		state.v = READ_ONCE(lock->state.v);
> +	} else {
> +		EBUG_ON(!(lock->state.v & l[type].held_mask));
> +		state.v = atomic64_add_return_release(l[type].unlock_val,
> +						      &lock->state.counter);
> +	}
> +
> +	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
> +}
> +
> +#define __SIX_LOCK(type)						\
> +bool six_trylock_##type(struct six_lock *lock)				\
> +{									\
> +	return __six_trylock_type(lock, SIX_LOCK_##type);		\
> +}									\
> +EXPORT_SYMBOL_GPL(six_trylock_##type);					\
> +									\
> +bool six_relock_##type(struct six_lock *lock, u32 seq)			\
> +{									\
> +	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
> +}									\
> +EXPORT_SYMBOL_GPL(six_relock_##type);					\
> +									\
> +int six_lock_##type(struct six_lock *lock,				\
> +		    six_lock_should_sleep_fn should_sleep_fn, void *p)	\
> +{									\
> +	return __six_lock_type(lock, SIX_LOCK_##type, should_sleep_fn, p);\
> +}									\
> +EXPORT_SYMBOL_GPL(six_lock_##type);					\
> +									\
> +void six_unlock_##type(struct six_lock *lock)				\
> +{									\
> +	__six_unlock_type(lock, SIX_LOCK_##type);			\
> +}									\
> +EXPORT_SYMBOL_GPL(six_unlock_##type);
> +
> +__SIX_LOCK(read)
> +__SIX_LOCK(intent)
> +__SIX_LOCK(write)
> +
> +#undef __SIX_LOCK
> +
> +/* Convert from intent to read: */
> +void six_lock_downgrade(struct six_lock *lock)
> +{
> +	six_lock_increment(lock, SIX_LOCK_read);
> +	six_unlock_intent(lock);
> +}
> +EXPORT_SYMBOL_GPL(six_lock_downgrade);
> +
> +bool six_lock_tryupgrade(struct six_lock *lock)
> +{
> +	union six_lock_state old, new;
> +	u64 v = READ_ONCE(lock->state.v);
> +
> +	do {
> +		new.v = old.v = v;
> +
> +		if (new.intent_lock)
> +			return false;
> +
> +		if (!lock->readers) {
> +			EBUG_ON(!new.read_lock);
> +			new.read_lock--;
> +		}
> +
> +		new.intent_lock = 1;
> +	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
> +				old.v, new.v)) != old.v);
> +
> +	if (lock->readers)
> +		this_cpu_dec(*lock->readers);
> +
> +	six_set_owner(lock, SIX_LOCK_intent, old);
> +
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(six_lock_tryupgrade);
> +
> +bool six_trylock_convert(struct six_lock *lock,
> +			 enum six_lock_type from,
> +			 enum six_lock_type to)
> +{
> +	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
> +
> +	if (to == from)
> +		return true;
> +
> +	if (to == SIX_LOCK_read) {
> +		six_lock_downgrade(lock);
> +		return true;
> +	} else {
> +		return six_lock_tryupgrade(lock);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(six_trylock_convert);
> +
> +/*
> + * Increment read/intent lock count, assuming we already have it read or intent
> + * locked:
> + */
> +void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
> +{
> +	const struct six_lock_vals l[] = LOCK_VALS;
> +
> +	six_acquire(&lock->dep_map, 0);
> +
> +	/* XXX: assert already locked, and that we don't overflow: */
> +
> +	switch (type) {
> +	case SIX_LOCK_read:
> +		if (lock->readers) {
> +			this_cpu_inc(*lock->readers);
> +		} else {
> +			EBUG_ON(!lock->state.read_lock &&
> +				!lock->state.intent_lock);
> +			atomic64_add(l[type].lock_val, &lock->state.counter);
> +		}
> +		break;
> +	case SIX_LOCK_intent:
> +		EBUG_ON(!lock->state.intent_lock);
> +		lock->intent_lock_recurse++;
> +		break;
> +	case SIX_LOCK_write:
> +		BUG();
> +		break;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(six_lock_increment);
> +
> +void six_lock_wakeup_all(struct six_lock *lock)
> +{
> +	struct six_lock_waiter *w;
> +
> +	raw_spin_lock(&lock->wait_lock);
> +
> +	list_for_each_entry(w, &lock->wait_list[0], list)
> +		wake_up_process(w->task);
> +	list_for_each_entry(w, &lock->wait_list[1], list)
> +		wake_up_process(w->task);
> +
> +	raw_spin_unlock(&lock->wait_lock);
> +}
> +EXPORT_SYMBOL_GPL(six_lock_wakeup_all);
> +
> +struct free_pcpu_rcu {
> +	struct rcu_head		rcu;
> +	void __percpu		*p;
> +};
> +
> +static void free_pcpu_rcu_fn(struct rcu_head *_rcu)
> +{
> +	struct free_pcpu_rcu *rcu =
> +		container_of(_rcu, struct free_pcpu_rcu, rcu);
> +
> +	free_percpu(rcu->p);
> +	kfree(rcu);
> +}
> +
> +void six_lock_pcpu_free_rcu(struct six_lock *lock)
> +{
> +	struct free_pcpu_rcu *rcu = kzalloc(sizeof(*rcu), GFP_KERNEL);
> +
> +	if (!rcu)
> +		return;
> +
> +	rcu->p = lock->readers;
> +	lock->readers = NULL;
> +
> +	call_rcu(&rcu->rcu, free_pcpu_rcu_fn);
> +}
> +EXPORT_SYMBOL_GPL(six_lock_pcpu_free_rcu);
> +
> +void six_lock_pcpu_free(struct six_lock *lock)
> +{
> +	BUG_ON(lock->readers && pcpu_read_count(lock));
> +	BUG_ON(lock->state.read_lock);
> +
> +	free_percpu(lock->readers);
> +	lock->readers = NULL;
> +}
> +EXPORT_SYMBOL_GPL(six_lock_pcpu_free);
> +
> +void six_lock_pcpu_alloc(struct six_lock *lock)
> +{
> +#ifdef __KERNEL__
> +	if (!lock->readers)
> +		lock->readers = alloc_percpu(unsigned);
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(six_lock_pcpu_alloc);
> +
> +/*
> + * Returns lock held counts, for both read and intent
> + */
> +struct six_lock_count six_lock_counts(struct six_lock *lock)
> +{
> +	struct six_lock_count ret = { 0, lock->state.intent_lock };
> +
> +	if (!lock->readers)
> +		ret.read += lock->state.read_lock;
> +	else {
> +		int cpu;
> +
> +		for_each_possible_cpu(cpu)
> +			ret.read += *per_cpu_ptr(lock->readers, cpu);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(six_lock_counts);

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-13 13:25       ` Lorenzo Stoakes
@ 2023-05-14 18:39         ` Christophe Leroy
  2023-05-14 23:43           ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christophe Leroy @ 2023-05-14 18:39 UTC (permalink / raw)
  To: Lorenzo Stoakes, Christoph Hellwig, Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, linux-mm



Le 13/05/2023 à 15:25, Lorenzo Stoakes a écrit :
> On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
>> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
>>> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
>>>> From: Kent Overstreet <kent.overstreet@gmail.com>
>>>>
>>>> This is needed for bcachefs, which dynamically generates per-btree node
>>>> unpack functions.
>>>
>>> No, we will never add back a way for random code allocating executable
>>> memory in kernel space.
>>
>> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
>> reinstating a helper function because the code is now used in more than one
>> place (at lsf/mm so a little distracted :)
>>
>> But it being exported is a problem. Perhaps there's another way of acheving the
>> same aim without having to do so?
> 
> Just to be abundantly clear, my original ack was a mistake (I overlooked
> the _exporting_ of the function being as significant as it is and assumed
> in an LSF/MM haze that it was simply a refactoring of _already available_
> functionality rather than newly providing a means to allocate directly
> executable kernel memory).
> 
> Exporting this is horrible for the numerous reasons expounded on in this
> thread, we need a different solution.
> 
> Nacked-by: Lorenzo Stoakes <lstoakes@gmail.com>
> 

I addition to that, I still don't understand why you bring back 
vmalloc_exec() instead of using module_alloc().

As reminded in a previous response, some architectures like powerpc/32s 
cannot allocate exec memory in vmalloc space. On powerpc this is because 
exec protection is performed on 256Mbytes segments and vmalloc space is 
flagged non-exec. Some other architectures have a constraint on distance 
between kernel core text and other text.

Today you have for instance kprobes in the kernel that need dynamic exec 
memory. It uses module_alloc() to get it. On some architectures you also 
have ftrace that gets some exec memory with module_alloc().

So, I still don't understand why you cannot use module_alloc() and need 
vmalloc_exec() instead.

Thanks
Christophe

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-14  5:45               ` Kent Overstreet
@ 2023-05-14 18:43                 ` Eric Biggers
  2023-05-15  5:38                   ` Kent Overstreet
  2023-05-15 10:29                 ` David Laight
  1 sibling, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-14 18:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Sun, May 14, 2023 at 01:45:29AM -0400, Kent Overstreet wrote:
> On Fri, May 12, 2023 at 06:57:52PM -0700, Eric Biggers wrote:
> > First, I wanted to mention that decoding of variable-length fields has been
> > extensively studied for decompression algorithms, e.g. for Huffman decoding.
> > And it turns out that it can be done branchlessly.  The basic idea is that you
> > have a branchless refill step that looks like the following:
> > 
> > #define REFILL_BITS_BRANCHLESS()                    \
> >         bitbuf |= get_unaligned_u64(p) << bitsleft; \
> >         p += 7 - ((bitsleft >> 3) & 0x7);           \
> >         bitsleft |= 56;
> > 
> > That branchlessly ensures that 'bitbuf' contains '56 <= bitsleft <= 63' bits.
> > Then, the needed number of bits can be removed and returned:
> > 
> > #define READ_BITS(n)                          \
> >         REFILL_BITS_BRANCHLESS();             \
> >         tmp = bitbuf & (((u64)1 << (n)) - 1); \
> >         bitbuf >>= (n);                       \
> >         bitsleft -= (n);                      \
> >         tmp
> > 
> > If you're interested, I can give you some references about the above method.
> 
> I might be interested in those references, new bit tricks and integer
> encodings are always fun :)

There are some good blog posts by Fabian Giese:

* https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far-too-many-ways-part-1/
* https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far-too-many-ways-part-2/
* https://fgiesen.wordpress.com/2018/09/27/reading-bits-in-far-too-many-ways-part-3/

And the examples I gave above are basically what I use in libdeflate:
https://github.com/ebiggers/libdeflate/blob/master/lib/deflate_decompress.c

> > But, I really just wanted to mention it for completeness, since I think you'd
> > actually want to go in a slightly different direction, since (a) you have all
> > the field widths available from the beginning, as opposed to being interleaved
> > into the bitstream itself (as is the case in Huffman decoding for example), so
> > you're not limited to serialized decoding of each field, (b) your fields are up
> > to 96 bits, and (c) you've selected a bitstream convention that seems to make it
> > such that your stream *must* be read in aligned units of u64, so I don't think
> > something like REFILL_BITS_BRANCHLESS() could work for you anyway.
> > 
> > What I would suggest instead is preprocessing the list of 6 field lengths to
> > create some information that can be used to extract all 6 fields branchlessly
> > with no dependencies between different fields.  (And you clearly *can* add a
> > preprocessing step, as you already have one -- the dynamic code generator.)
> > 
> > So, something like the following:
> > 
> >     const struct field_info *info = &format->fields[0];
> > 
> >     field0 = (in->u64s[info->word_idx] >> info->shift1) & info->mask;
> >     field0 |= in->u64s[info->word_idx - 1] >> info->shift2;
> > 
> > ... but with the code for all 6 fields interleaved.
> > 
> > On modern CPUs, I think that would be faster than your current C code.
> > 
> > You could do better by creating variants that are specialized for specific
> > common sets of parameters.  During "preprocessing", you would select a variant
> > and set an enum accordingly.  During decoding, you would switch on that enum and
> > call the appropriate variant.  (This could also be done with a function pointer,
> > of course, but indirect calls are slow these days...)
> 
> testing random btree updates:
> 
> dynamically generated unpack:
> rand_insert: 20.0 MiB with 1 threads in    33 sec,  1609 nsec per iter, 607 KiB per sec
> 
> old C unpack:
> rand_insert: 20.0 MiB with 1 threads in    35 sec,  1672 nsec per iter, 584 KiB per sec
> 
> the Eric Biggers special:
> rand_insert: 20.0 MiB with 1 threads in    35 sec,  1676 nsec per iter, 583 KiB per sec
> 
> Tested two versions of your approach, one without a shift value, one
> where we use a shift value to try to avoid unaligned access - second was
> perhaps 1% faster
> 
> so it's not looking good. This benchmark doesn't even hit on
> unpack_key() quite as much as I thought, so the difference is
> significant.
> 
> diff --git a/fs/bcachefs/bkey.c b/fs/bcachefs/bkey.c

I don't know what this patch applies to, so I can't properly review it.

I suggest checking the assembly and making sure it is what is expected.

In general, for this type of thing it's also helpful to put together a userspace
micro-benchmark program so that it's very fast to evaluate different options.
Building and booting a kernel and doing some I/O benchmark on a bcachefs sounds
much more time consuming and less precise.

> -struct bkey __bch2_bkey_unpack_key(const struct bkey_format_processed *format_p,
> +struct bkey __bch2_bkey_unpack_key(const struct bkey_format_processed *format,
>  				   const struct bkey_packed *in)
>  {
> -	const struct bkey_format *format = &format_p->f;
> -	struct unpack_state state = unpack_state_init(format, in);
>  	struct bkey out;
>  
> -	EBUG_ON(format->nr_fields != BKEY_NR_FIELDS);
> -	EBUG_ON(in->u64s < format->key_u64s);
> +	EBUG_ON(format->f.nr_fields != BKEY_NR_FIELDS);
> +	EBUG_ON(in->u64s < format->f.key_u64s);
>  	EBUG_ON(in->format != KEY_FORMAT_LOCAL_BTREE);
> -	EBUG_ON(in->u64s - format->key_u64s + BKEY_U64s > U8_MAX);
> +	EBUG_ON(in->u64s - format->f.key_u64s + BKEY_U64s > U8_MAX);
>  
> -	out.u64s	= BKEY_U64s + in->u64s - format->key_u64s;
> +	out.u64s	= BKEY_U64s + in->u64s - format->f.key_u64s;
>  	out.format	= KEY_FORMAT_CURRENT;
>  	out.needs_whiteout = in->needs_whiteout;
>  	out.type	= in->type;
>  	out.pad[0]	= 0;
>  
> +	if (likely(format->aligned)) {
> +#define x(id, field)	out.field = get_aligned_field(format, in, id);
> +		bkey_fields()
> +#undef x
> +	} else {
> +		struct unpack_state state = unpack_state_init(&format->f, in);
> +
>  #define x(id, field)	out.field = get_inc_field(&state, id);
> -	bkey_fields()
> +		bkey_fields()
>  #undef x
> +	}

It looks like you didn't change the !aligned case.  How often is the 'aligned'
case taken?

I think it would also help if the generated assembly had the handling of the
fields interleaved.  To achieve that, it might be necessary to interleave the C
code.

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-14 18:39         ` Christophe Leroy
@ 2023-05-14 23:43           ` Kent Overstreet
  2023-05-15  4:45             ` Christophe Leroy
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-14 23:43 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Sun, May 14, 2023 at 06:39:00PM +0000, Christophe Leroy wrote:
> I addition to that, I still don't understand why you bring back 
> vmalloc_exec() instead of using module_alloc().
> 
> As reminded in a previous response, some architectures like powerpc/32s 
> cannot allocate exec memory in vmalloc space. On powerpc this is because 
> exec protection is performed on 256Mbytes segments and vmalloc space is 
> flagged non-exec. Some other architectures have a constraint on distance 
> between kernel core text and other text.
> 
> Today you have for instance kprobes in the kernel that need dynamic exec 
> memory. It uses module_alloc() to get it. On some architectures you also 
> have ftrace that gets some exec memory with module_alloc().
> 
> So, I still don't understand why you cannot use module_alloc() and need 
> vmalloc_exec() instead.

Because I didn't know about it :)

Looks like that is indeed the appropriate interface (if a bit poorly
named), I'll switch to using that, thanks.

It'll still need to be exported, but it looks like the W|X attribute
discussion is not really germane here since it's what other in kernel
users are using, and there's nothing particularly special about how
bcachefs is using it compared to them.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-14 12:15   ` Jeff Layton
@ 2023-05-15  2:39     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-15  2:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng

On Sun, May 14, 2023 at 08:15:20AM -0400, Jeff Layton wrote:
> So the idea is to create a fundamentally unfair rwsem? One that always
> prefers readers over writers?

No, not sure where you're getting that from. It's unfair, but writes are
preferred over readers :)

> 
> > + * Other operations:
> > + *
> > + *   six_trylock_read()
> > + *   six_trylock_intent()
> > + *   six_trylock_write()
> > + *
> > + *   six_lock_downgrade():	convert from intent to read
> > + *   six_lock_tryupgrade():	attempt to convert from read to intent
> > + *
> > + * Locks also embed a sequence number, which is incremented when the lock is
> > + * locked or unlocked for write. The current sequence number can be grabbed
> > + * while a lock is held from lock->state.seq; then, if you drop the lock you can
> > + * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
> > + * iff it hasn't been locked for write in the meantime.
> > + *
> 
> ^^^
> This is a cool idea.

It's used heavily in bcachefs so we can drop locks if we might be
blocking - and then relock and continue, at the cost of a transaction
restart if the relock fails. It's a huge win for tail latency.

> > + * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
> > + *
> > + *   six_lock_type(lock, type)
> > + *   six_unlock_type(lock, type)
> > + *   six_relock(lock, type, seq)
> > + *   six_trylock_type(lock, type)
> > + *   six_trylock_convert(lock, from, to)
> > + *
> > + * A lock may be held multiple types by the same thread (for read or intent,
> > + * not write). However, the six locks code does _not_ implement the actual
> > + * recursive checks itself though - rather, if your code (e.g. btree iterator
> > + * code) knows that the current thread already has a lock held, and for the
> > + * correct type, six_lock_increment() may be used to bump up the counter for
> > + * that type - the only effect is that one more call to unlock will be required
> > + * before the lock is unlocked.
> 
> Thse semantics are a bit confusing. Once you hold a read or intent lock,
> you can take it as many times as you like. What happens if I take it in
> one context and release it in another? Say, across a workqueue job for
> instance?

Not allowed because of lockdep, same as with other locks.

> Are intent locks "converted" to write locks, or do they stack? For
> instance, suppose I take the intent lock 3 times and then take a write
> lock. How many times do I have to call unlock to fully release it (3 or
> 4)? If I release it just once, do I still hold the write lock or am I
> back to "intent" state?

They stack. You'd call unlock_write() once ad unlock_intent() three
times.

> Some basic info about the underlying design would be nice here. What
> info is tracked in the union below? When are different members being
> used? How does the code decide which way to cast this thing? etc.

The field names seem pretty descriptive to me.

counter, v are just for READ_ONCE/atomic64 cmpxchg ops.

> Ewww...bitfields. That seems a bit scary in a union. There is no
> guarantee that the underlying arch will even pack that into a single
> word, AIUI. It may be safer to do this with masking and shifting
> instead.

It wouldn't hurt to add a BUILD_BUG_ON() for the size, but I don't find
anything "scary" about unions and bitfields :)

And it makes the code more descriptive and readable than masking and
shifting.

> > +static __always_inline bool do_six_trylock_type(struct six_lock *lock,
> > +						enum six_lock_type type,
> > +						bool try)
> > +{
> > +	const struct six_lock_vals l[] = LOCK_VALS;
> > +	union six_lock_state old, new;
> > +	bool ret;
> > +	u64 v;
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
> > +	EBUG_ON(type == SIX_LOCK_write && (lock->state.seq & 1));
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && (try != !(lock->state.write_locking)));
> > +
> > +	/*
> > +	 * Percpu reader mode:
> > +	 *
> > +	 * The basic idea behind this algorithm is that you can implement a lock
> > +	 * between two threads without any atomics, just memory barriers:
> > +	 *
> > +	 * For two threads you'll need two variables, one variable for "thread a
> > +	 * has the lock" and another for "thread b has the lock".
> > +	 *
> > +	 * To take the lock, a thread sets its variable indicating that it holds
> > +	 * the lock, then issues a full memory barrier, then reads from the
> > +	 * other thread's variable to check if the other thread thinks it has
> > +	 * the lock. If we raced, we backoff and retry/sleep.
> > +	 */
> > +
> > +	if (type == SIX_LOCK_read && lock->readers) {
> > +retry:
> > +		preempt_disable();
> > +		this_cpu_inc(*lock->readers); /* signal that we own lock */
> > +
> > +		smp_mb();
> > +
> > +		old.v = READ_ONCE(lock->state.v);
> > +		ret = !(old.v & l[type].lock_fail);
> > +
> > +		this_cpu_sub(*lock->readers, !ret);
> > +		preempt_enable();
> > +
> > +		/*
> > +		 * If we failed because a writer was trying to take the
> > +		 * lock, issue a wakeup because we might have caused a
> > +		 * spurious trylock failure:
> > +		 */
> > +		if (old.write_locking) {
> > +			struct task_struct *p = READ_ONCE(lock->owner);
> > +
> > +			if (p)
> > +				wake_up_process(p);
> > +		}
> > +
> > +		/*
> > +		 * If we failed from the lock path and the waiting bit wasn't
> > +		 * set, set it:
> > +		 */
> > +		if (!try && !ret) {
> > +			v = old.v;
> > +
> > +			do {
> > +				new.v = old.v = v;
> > +
> > +				if (!(old.v & l[type].lock_fail))
> > +					goto retry;
> > +
> > +				if (new.waiters & (1 << type))
> > +					break;
> > +
> > +				new.waiters |= 1 << type;
> > +			} while ((v = atomic64_cmpxchg(&lock->state.counter,
> > +						       old.v, new.v)) != old.v);
> > +		}
> > +	} else if (type == SIX_LOCK_write && lock->readers) {
> > +		if (try) {
> > +			atomic64_add(__SIX_VAL(write_locking, 1),
> > +				     &lock->state.counter);
> > +			smp_mb__after_atomic();
> > +		}
> > +
> > +		ret = !pcpu_read_count(lock);
> > +
> > +		/*
> > +		 * On success, we increment lock->seq; also we clear
> > +		 * write_locking unless we failed from the lock path:
> > +		 */
> > +		v = 0;
> > +		if (ret)
> > +			v += __SIX_VAL(seq, 1);
> > +		if (ret || try)
> > +			v -= __SIX_VAL(write_locking, 1);
> > +
> > +		if (try && !ret) {
> > +			old.v = atomic64_add_return(v, &lock->state.counter);
> > +			six_lock_wakeup(lock, old, SIX_LOCK_read);
> > +		} else {
> > +			atomic64_add(v, &lock->state.counter);
> > +		}
> > +	} else {
> > +		v = READ_ONCE(lock->state.v);
> > +		do {
> > +			new.v = old.v = v;
> > +
> > +			if (!(old.v & l[type].lock_fail)) {
> > +				new.v += l[type].lock_val;
> > +
> > +				if (type == SIX_LOCK_write)
> > +					new.write_locking = 0;
> > +			} else if (!try && type != SIX_LOCK_write &&
> > +				   !(new.waiters & (1 << type)))
> > +				new.waiters |= 1 << type;
> > +			else
> > +				break; /* waiting bit already set */
> > +		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
> > +					old.v, new.v)) != old.v);
> > +
> > +		ret = !(old.v & l[type].lock_fail);
> > +
> > +		EBUG_ON(ret && !(lock->state.v & l[type].held_mask));
> > +	}
> > +
> > +	if (ret)
> > +		six_set_owner(lock, type, old);
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && (try || ret) && (lock->state.write_locking));
> > +
> > +	return ret;
> > +}
> > +
> 
> ^^^
> I'd really like to see some more comments in the code above. It's pretty
> complex.

It's already got more comments than is typical for kernel locking code :)

But if there's specific things you'd like to see clarified, please do
point them out.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-14 23:43           ` Kent Overstreet
@ 2023-05-15  4:45             ` Christophe Leroy
  2023-05-15  5:02               ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christophe Leroy @ 2023-05-15  4:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm



Le 15/05/2023 à 01:43, Kent Overstreet a écrit :
> On Sun, May 14, 2023 at 06:39:00PM +0000, Christophe Leroy wrote:
>> I addition to that, I still don't understand why you bring back
>> vmalloc_exec() instead of using module_alloc().
>>
>> As reminded in a previous response, some architectures like powerpc/32s
>> cannot allocate exec memory in vmalloc space. On powerpc this is because
>> exec protection is performed on 256Mbytes segments and vmalloc space is
>> flagged non-exec. Some other architectures have a constraint on distance
>> between kernel core text and other text.
>>
>> Today you have for instance kprobes in the kernel that need dynamic exec
>> memory. It uses module_alloc() to get it. On some architectures you also
>> have ftrace that gets some exec memory with module_alloc().
>>
>> So, I still don't understand why you cannot use module_alloc() and need
>> vmalloc_exec() instead.
> 
> Because I didn't know about it :)
> 
> Looks like that is indeed the appropriate interface (if a bit poorly
> named), I'll switch to using that, thanks.
> 
> It'll still need to be exported, but it looks like the W|X attribute
> discussion is not really germane here since it's what other in kernel
> users are using, and there's nothing particularly special about how
> bcachefs is using it compared to them.

The W|X subject is applicable.

If you look into powerpc's module_alloc(), you'll see that when 
CONFIG_STRICT_MODULE_RWX is selected, module_alloc() allocate 
PAGE_KERNEL memory. It is then up to the consumer to change it to RO-X.

See for instance in arch/powerpc/kernel/kprobes.c:

void *alloc_insn_page(void)
{
	void *page;

	page = module_alloc(PAGE_SIZE);
	if (!page)
		return NULL;

	if (strict_module_rwx_enabled())
		set_memory_rox((unsigned long)page, 1);

	return page;
}


Christophe

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  4:45             ` Christophe Leroy
@ 2023-05-15  5:02               ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-15  5:02 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Mon, May 15, 2023 at 04:45:42AM +0000, Christophe Leroy wrote:
> 
> 
> Le 15/05/2023 à 01:43, Kent Overstreet a écrit :
> > On Sun, May 14, 2023 at 06:39:00PM +0000, Christophe Leroy wrote:
> >> I addition to that, I still don't understand why you bring back
> >> vmalloc_exec() instead of using module_alloc().
> >>
> >> As reminded in a previous response, some architectures like powerpc/32s
> >> cannot allocate exec memory in vmalloc space. On powerpc this is because
> >> exec protection is performed on 256Mbytes segments and vmalloc space is
> >> flagged non-exec. Some other architectures have a constraint on distance
> >> between kernel core text and other text.
> >>
> >> Today you have for instance kprobes in the kernel that need dynamic exec
> >> memory. It uses module_alloc() to get it. On some architectures you also
> >> have ftrace that gets some exec memory with module_alloc().
> >>
> >> So, I still don't understand why you cannot use module_alloc() and need
> >> vmalloc_exec() instead.
> > 
> > Because I didn't know about it :)
> > 
> > Looks like that is indeed the appropriate interface (if a bit poorly
> > named), I'll switch to using that, thanks.
> > 
> > It'll still need to be exported, but it looks like the W|X attribute
> > discussion is not really germane here since it's what other in kernel
> > users are using, and there's nothing particularly special about how
> > bcachefs is using it compared to them.
> 
> The W|X subject is applicable.
> 
> If you look into powerpc's module_alloc(), you'll see that when 
> CONFIG_STRICT_MODULE_RWX is selected, module_alloc() allocate 
> PAGE_KERNEL memory. It is then up to the consumer to change it to RO-X.
> 
> See for instance in arch/powerpc/kernel/kprobes.c:
> 
> void *alloc_insn_page(void)
> {
> 	void *page;
> 
> 	page = module_alloc(PAGE_SIZE);
> 	if (!page)
> 		return NULL;
> 
> 	if (strict_module_rwx_enabled())
> 		set_memory_rox((unsigned long)page, 1);
> 
> 	return page;
> }

Yeah.

I'm looking at the bpf code now.

<RANT MODE, YOU ARE WARNED>

Can I just say, for the record - god damn this situation is starting to
piss me off? This really nicely encapsulates everything I hate about
kernel development processes and culture and the fscking messes that get
foisted upon people as a result.

All I'm trying to do is write a fucking filesystem here people, I've got
enough on my plate. Dealing with the fallout of a kernel interface going
away without a proper replacement was NOT WHAT I FUCKING HAD IN MIND?

5% performance regression without this. That's just not acceptable, I
can't produce a filesystem that people will in the end want to use by
leaving performance on the table, it's death of a thousand cuts if I
take that attitude. Every 1% needs to be accounted for, a 5% performance
regression is flat out not going to happen.

And the real icing on this motherfucking turd sandwich of a cake, is
that I'm not the first person to have to solve this particular technical
problem.

BPF has the code I need.

But, in true kernel fashion, did they recognize that this was a
subproblem they could write as a library, both making their code more
modular and easier to understand, as well as, oh I don't know, not
leaving a giant steaming turd for the next person to come along?

Nope.

I'd be embarassed if I was responsible for this.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-14 18:43                 ` Eric Biggers
@ 2023-05-15  5:38                   ` Kent Overstreet
  2023-05-15  6:13                     ` Eric Biggers
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-15  5:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Sun, May 14, 2023 at 11:43:25AM -0700, Eric Biggers wrote:
> I think it would also help if the generated assembly had the handling of the
> fields interleaved.  To achieve that, it might be necessary to interleave the C
> code.

No, that has negligable effect on performance - as expected, for an out
of order processor. < 1% improvement.

It doesn't look like this approach is going to work here. Sadly.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  5:38                   ` Kent Overstreet
@ 2023-05-15  6:13                     ` Eric Biggers
  2023-05-15  6:18                       ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-15  6:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Mon, May 15, 2023 at 01:38:51AM -0400, Kent Overstreet wrote:
> On Sun, May 14, 2023 at 11:43:25AM -0700, Eric Biggers wrote:
> > I think it would also help if the generated assembly had the handling of the
> > fields interleaved.  To achieve that, it might be necessary to interleave the C
> > code.
> 
> No, that has negligable effect on performance - as expected, for an out
> of order processor. < 1% improvement.
> 
> It doesn't look like this approach is going to work here. Sadly.

I'd be glad to take a look at the code you actually tried.  It would be helpful
if you actually provided it, instead of just this "I tried it, I'm giving up
now" sort of thing.

I was also hoping you'd take the time to split this out into a userspace
micro-benchmark program that we could quickly try different approaches on.

BTW, even if people are okay with dynamic code generation (which seems
unlikely?), you'll still need a C version for architectures that you haven't
implemented the dynamic code generation for.

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  6:13                     ` Eric Biggers
@ 2023-05-15  6:18                       ` Kent Overstreet
  2023-05-15  7:13                         ` Eric Biggers
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-15  6:18 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Sun, May 14, 2023 at 11:13:46PM -0700, Eric Biggers wrote:
> On Mon, May 15, 2023 at 01:38:51AM -0400, Kent Overstreet wrote:
> > On Sun, May 14, 2023 at 11:43:25AM -0700, Eric Biggers wrote:
> > > I think it would also help if the generated assembly had the handling of the
> > > fields interleaved.  To achieve that, it might be necessary to interleave the C
> > > code.
> > 
> > No, that has negligable effect on performance - as expected, for an out
> > of order processor. < 1% improvement.
> > 
> > It doesn't look like this approach is going to work here. Sadly.
> 
> I'd be glad to take a look at the code you actually tried.  It would be helpful
> if you actually provided it, instead of just this "I tried it, I'm giving up
> now" sort of thing.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bkey_unpack

> I was also hoping you'd take the time to split this out into a userspace
> micro-benchmark program that we could quickly try different approaches on.

I don't need to, because I already have this:
https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/perf.ktest

> BTW, even if people are okay with dynamic code generation (which seems
> unlikely?), you'll still need a C version for architectures that you haven't
> implemented the dynamic code generation for.

Excuse me? There already is a C version, and we've been discussing it.
Your approach wasn't any faster than the existing C version.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  6:18                       ` Kent Overstreet
@ 2023-05-15  7:13                         ` Eric Biggers
  2023-05-15  7:26                           ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-15  7:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Mon, May 15, 2023 at 02:18:14AM -0400, Kent Overstreet wrote:
> On Sun, May 14, 2023 at 11:13:46PM -0700, Eric Biggers wrote:
> > On Mon, May 15, 2023 at 01:38:51AM -0400, Kent Overstreet wrote:
> > > On Sun, May 14, 2023 at 11:43:25AM -0700, Eric Biggers wrote:
> > > > I think it would also help if the generated assembly had the handling of the
> > > > fields interleaved.  To achieve that, it might be necessary to interleave the C
> > > > code.
> > > 
> > > No, that has negligable effect on performance - as expected, for an out
> > > of order processor. < 1% improvement.
> > > 
> > > It doesn't look like this approach is going to work here. Sadly.
> > 
> > I'd be glad to take a look at the code you actually tried.  It would be helpful
> > if you actually provided it, instead of just this "I tried it, I'm giving up
> > now" sort of thing.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bkey_unpack
> 
> > I was also hoping you'd take the time to split this out into a userspace
> > micro-benchmark program that we could quickly try different approaches on.
> 
> I don't need to, because I already have this:
> https://evilpiepirate.org/git/ktest.git/tree/tests/bcachefs/perf.ktest

Sure, given that this is an optimization problem with a very small scope
(decoding 6 fields from a bitstream), I was hoping for something easier and
faster to iterate on than setting up a full kernel + bcachefs test environment
and reverse engineering 500 lines of shell script.  But sure, I can look into
that when I have a chance.

> Your approach wasn't any faster than the existing C version.

Well, it's your implementation of what you thought was "my approach".  It
doesn't quite match what I had suggested.  As I mentioned in my last email, it's
also unclear that your new code is ever actually executed, since you made it
conditional on all fields being byte-aligned...

- Eric

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  7:13                         ` Eric Biggers
@ 2023-05-15  7:26                           ` Kent Overstreet
  2023-05-21 21:33                             ` Eric Biggers
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-15  7:26 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Mon, May 15, 2023 at 12:13:43AM -0700, Eric Biggers wrote:
> Sure, given that this is an optimization problem with a very small scope
> (decoding 6 fields from a bitstream), I was hoping for something easier and
> faster to iterate on than setting up a full kernel + bcachefs test environment
> and reverse engineering 500 lines of shell script.  But sure, I can look into
> that when I have a chance.

If you were actually wanting to help, that repository is the tool I use
for kernel development and testing - it's got documentation.

It builds a kernel, boots a VM and runs a test in about 15 seconds, no
need for lifting that code out to userspace.

> > Your approach wasn't any faster than the existing C version.
> 
> Well, it's your implementation of what you thought was "my approach".  It
> doesn't quite match what I had suggested.  As I mentioned in my last email, it's
> also unclear that your new code is ever actually executed, since you made it
> conditional on all fields being byte-aligned...

Eric, I'm not an idiot, that was one of the first things I checked. No
unaligned bkey formats were generated in my tests.

The latest iteration of your approach that I looked at compiled to ~250
bytes of code, vs. ~50 bytes for the dynamically generated unpack
functions. I'm sure it's possible to shave a bit off with some more
work, but looking at the generated code it's clear it's not going to be
competitive.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-14  5:45               ` Kent Overstreet
  2023-05-14 18:43                 ` Eric Biggers
@ 2023-05-15 10:29                 ` David Laight
  1 sibling, 0 replies; 186+ messages in thread
From: David Laight @ 2023-05-15 10:29 UTC (permalink / raw)
  To: 'Kent Overstreet', Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

From: Kent Overstreet
> Sent: 14 May 2023 06:45
...
> dynamically generated unpack:
> rand_insert: 20.0 MiB with 1 threads in    33 sec,  1609 nsec per iter, 607 KiB per sec
> 
> old C unpack:
> rand_insert: 20.0 MiB with 1 threads in    35 sec,  1672 nsec per iter, 584 KiB per sec
> 
> the Eric Biggers special:
> rand_insert: 20.0 MiB with 1 threads in    35 sec,  1676 nsec per iter, 583 KiB per sec
> 
> Tested two versions of your approach, one without a shift value, one
> where we use a shift value to try to avoid unaligned access - second was
> perhaps 1% faster

You won't notice any effect of avoiding unaligned accesses on x86.
I think then get split into 64bit accesses and again on 64 byte
boundaries (that is what I see for uncached access to PCIe).
The kernel won't be doing >64bit and the 'out of order'
pipeline will tend to cover the others (especially since you
get 2 reads/clock).

> so it's not looking good. This benchmark doesn't even hit on
> unpack_key() quite as much as I thought, so the difference is
> significant.

Beware: unless you manage to lock the cpu frequency (which is ~impossible
on some cpu) timings in nanoseconds are pretty useless.
You can use the performance counter to get accurate cycle times
(provided there isn't a cpu switch in the middle of a micro-benchmark).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-10  4:45   ` Dave Chinner
@ 2023-05-16 15:45     ` Christian Brauner
  2023-05-16 16:17       ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christian Brauner @ 2023-05-16 15:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Dave Chinner, Alexander Viro

On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because scalability of the global inode_hash_lock really, really
> > sucks.
> > 
> > 32-way concurrent create on a couple of different filesystems
> > before:
> > 
> > -   52.13%     0.04%  [kernel]            [k] ext4_create
> >    - 52.09% ext4_create
> >       - 41.03% __ext4_new_inode
> >          - 29.92% insert_inode_locked
> >             - 25.35% _raw_spin_lock
> >                - do_raw_spin_lock
> >                   - 24.97% __pv_queued_spin_lock_slowpath
> > 
> > -   72.33%     0.02%  [kernel]            [k] do_filp_open
> >    - 72.31% do_filp_open
> >       - 72.28% path_openat
> >          - 57.03% bch2_create
> >             - 56.46% __bch2_create
> >                - 40.43% inode_insert5
> >                   - 36.07% _raw_spin_lock
> >                      - do_raw_spin_lock
> >                           35.86% __pv_queued_spin_lock_slowpath
> >                     4.02% find_inode
> > 
> > Convert the inode hash table to a RCU-aware hash-bl table just like
> > the dentry cache. Note that we need to store a pointer to the
> > hlist_bl_head the inode has been added to in the inode so that when
> > it comes to unhash the inode we know what list to lock. We need to
> > do this because the hash value that is used to hash the inode is
> > generated from the inode itself - filesystems can provide this
> > themselves so we have to either store the hash or the head pointer
> > in the inode to be able to find the right list head for removal...
> > 
> > Same workload after:
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: linux-fsdevel@vger.kernel.org
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> I have been maintaining this patchset uptodate in my own local trees
> and the code in this patch looks the same. The commit message above,
> however, has been mangled. The full commit message should be:
> 
> vfs: inode cache conversion to hash-bl
> 
> Because scalability of the global inode_hash_lock really, really
> sucks and prevents me from doing scalability characterisation and
> analysis of bcachefs algorithms.
> 
> Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
> on a couple of different filesystems on a 5.10 kernel:
> 
> -   52.13%     0.04%  [kernel]            [k] ext4_create
>    - 52.09% ext4_create
>       - 41.03% __ext4_new_inode
>          - 29.92% insert_inode_locked
>             - 25.35% _raw_spin_lock
>                - do_raw_spin_lock
>                   - 24.97% __pv_queued_spin_lock_slowpath
> 
> 
> -   72.33%     0.02%  [kernel]            [k] do_filp_open
>    - 72.31% do_filp_open
>       - 72.28% path_openat
>          - 57.03% bch2_create
>             - 56.46% __bch2_create
>                - 40.43% inode_insert5
>                   - 36.07% _raw_spin_lock
>                      - do_raw_spin_lock
>                           35.86% __pv_queued_spin_lock_slowpath
>                     4.02% find_inode
> 
> btrfs was tested but it is limited by internal lock contention at
> >=2 threads on this workload, so never hammers the inode cache lock
> hard enough for this change to matter to it's performance.
> 
> However, both bcachefs and ext4 demonstrate poor scaling at >=8
> threads on concurrent lookup or create workloads.
> 
> Hence convert the inode hash table to a RCU-aware hash-bl table just
> like the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because, unlike the dentry cache, the hash value that is
> used to hash the inode is not generated from the inode itself. i.e.
> filesystems can provide this themselves so we have to either store
> the hashval or the hlist head pointer in the inode to be able to
> find the right list head for removal...
> 
> Concurrent create with variying thread count (files/s):
> 
>                 ext4                    bcachefs
> threads         vanilla  patched        vanilla patched
> 2               117k     112k            80k     85k
> 4               185k     190k           133k    145k
> 8               303k     346k           185k    255k
> 16              389k     465k           190k    420k
> 32              360k     437k           142k    481k
> 
> CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
> halved on the patched kernel, while performance has increased
> marginally on ext4 and massively on bcachefs. Internal filesystem
> algorithms now limit performance on these workloads, not the global
> inode_hash_lock.
> 
> Profile of the workloads on the patched kernels:
> 
> -   35.94%     0.07%  [kernel]                  [k] ext4_create
>    - 35.87% ext4_create
>       - 20.45% __ext4_new_inode
> ...
>            3.36% insert_inode_locked
> 
>    - 78.43% do_filp_open
>       - 78.36% path_openat
>          - 53.95% bch2_create
>             - 47.99% __bch2_create
> ....
>               - 7.57% inode_insert5
>                     6.94% find_inode
> 
> Spinlock contention is largely gone from the inode hash operations
> and the filesystems are limited by contention in their internal
> algorithms.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> 
> Other than that, the diffstat is the same and I don't see any obvious
> differences in the code comapred to what I've been running locally.

There's a bit of a backlog before I get around to looking at this but
it'd be great if we'd have a few reviewers for this change.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-16 15:45     ` Christian Brauner
@ 2023-05-16 16:17       ` Kent Overstreet
  2023-05-16 23:15         ` Dave Chinner
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-16 16:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Dave Chinner, Alexander Viro

On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> There's a bit of a backlog before I get around to looking at this but
> it'd be great if we'd have a few reviewers for this change.

It is well tested - it's been in the bcachefs tree for ages with zero
issues. I'm pulling it out of the bcachefs-prerequisites series though
since Dave's still got it in his tree, he's got a newer version with
better commit messages.

It's a significant performance boost on metadata heavy workloads for any
non-XFS filesystem, we should definitely get it in.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-12 18:41       ` Kent Overstreet
@ 2023-05-16 21:02         ` Kees Cook
  2023-05-16 21:20           ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Kees Cook @ 2023-05-16 21:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Fri, May 12, 2023 at 02:41:50PM -0400, Kent Overstreet wrote:
> On Thu, May 11, 2023 at 03:28:40PM -0700, Kees Cook wrote:
> > On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> > > On 09.05.23 18:56, Kent Overstreet wrote:
> > > > +/**
> > > > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > > > + * @size:	  allocation size
> > > > + *
> > > > + * Kernel-internal function to allocate enough pages to cover @size
> > > > + * the page level allocator and map them into contiguous and
> > > > + * executable kernel virtual space.
> > > > + *
> > > > + * For tight control over page level allocator and protection flags
> > > > + * use __vmalloc() instead.
> > > > + *
> > > > + * Return: pointer to the allocated memory or %NULL on error
> > > > + */
> > > > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > > > +{
> > > > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > > > +			NUMA_NO_NODE, __builtin_return_address(0));
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> > > 
> > > Uh W+X memory reagions.
> > > The 90s called, they want their shellcode back.
> > 
> > Just to clarify: the kernel must never create W+X memory regions. So,
> > no, do not reintroduce vmalloc_exec().
> > 
> > Dynamic code areas need to be constructed in a non-executable memory,
> > then switched to read-only and verified to still be what was expected,
> > and only then made executable.
> 
> So if we're opening this up to the topic if what an acceptible API would
> look like - how hard is this requirement?
> 
> The reason is that the functions we're constructing are only ~50 bytes,
> so we don't want to be burning a full page per function (particularly
> for the 64kb page architectures...)

For something that small, why not use the text_poke API?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:02         ` Kees Cook
@ 2023-05-16 21:20           ` Kent Overstreet
  2023-05-16 21:47             ` Matthew Wilcox
  2023-06-17  4:13             ` Andy Lutomirski
  0 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-16 21:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> For something that small, why not use the text_poke API?

This looks like it's meant for patching existing kernel text, which
isn't what I want - I'm generating new functions on the fly, one per
btree node.

I'm working up a new allocator - a (very simple) slab allocator where
you pass a buffer, and it gives you a copy of that buffer mapped
executable, but not writeable.

It looks like we'll be able to convert bpf, kprobes, and ftrace
trampolines to it; it'll consolidate a fair amount of code (particularly
in bpf), and they won't have to burn a full page per allocation anymore.

bpf has a neat trick where it maps the same page in two different
locations, one is the executable location and the other is the writeable
location - I'm stealing that.

external api will be:

void *jit_alloc(void *buf, size_t len, gfp_t gfp);
void jit_free(void *buf);
void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:20           ` Kent Overstreet
@ 2023-05-16 21:47             ` Matthew Wilcox
  2023-05-16 21:57               ` Kent Overstreet
  2023-05-17  5:28               ` Kent Overstreet
  2023-06-17  4:13             ` Andy Lutomirski
  1 sibling, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-05-16 21:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > For something that small, why not use the text_poke API?
> 
> This looks like it's meant for patching existing kernel text, which
> isn't what I want - I'm generating new functions on the fly, one per
> btree node.
> 
> I'm working up a new allocator - a (very simple) slab allocator where
> you pass a buffer, and it gives you a copy of that buffer mapped
> executable, but not writeable.
> 
> It looks like we'll be able to convert bpf, kprobes, and ftrace
> trampolines to it; it'll consolidate a fair amount of code (particularly
> in bpf), and they won't have to burn a full page per allocation anymore.
> 
> bpf has a neat trick where it maps the same page in two different
> locations, one is the executable location and the other is the writeable
> location - I'm stealing that.

How does that avoid the problem of being able to construct an arbitrary
gadget that somebody else will then execute?  IOW, what bpf has done
seems like it's working around & undoing the security improvements.

I suppose it's an improvement that only the executable address is
passed back to the caller, and not the writable address.

> external api will be:
> 
> void *jit_alloc(void *buf, size_t len, gfp_t gfp);
> void jit_free(void *buf);
> void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */
> 

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:47             ` Matthew Wilcox
@ 2023-05-16 21:57               ` Kent Overstreet
  2023-05-17  5:28               ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-16 21:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> > 
> > I'm working up a new allocator - a (very simple) slab allocator where
> > you pass a buffer, and it gives you a copy of that buffer mapped
> > executable, but not writeable.
> > 
> > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > trampolines to it; it'll consolidate a fair amount of code (particularly
> > in bpf), and they won't have to burn a full page per allocation anymore.
> > 
> > bpf has a neat trick where it maps the same page in two different
> > locations, one is the executable location and the other is the writeable
> > location - I'm stealing that.
> 
> How does that avoid the problem of being able to construct an arbitrary
> gadget that somebody else will then execute?  IOW, what bpf has done
> seems like it's working around & undoing the security improvements.
> 
> I suppose it's an improvement that only the executable address is
> passed back to the caller, and not the writable address.

That's my thinking; grepping around finds several uses of module_alloc()
that are all doing different variations on the page permissions dance.
Let's just do it once and do it right...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-16 16:17       ` Kent Overstreet
@ 2023-05-16 23:15         ` Dave Chinner
  2023-05-22 13:04           ` Christian Brauner
  0 siblings, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-05-16 23:15 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Dave Chinner, Alexander Viro

On Tue, May 16, 2023 at 12:17:04PM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> > On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> > There's a bit of a backlog before I get around to looking at this but
> > it'd be great if we'd have a few reviewers for this change.
> 
> It is well tested - it's been in the bcachefs tree for ages with zero
> issues. I'm pulling it out of the bcachefs-prerequisites series though
> since Dave's still got it in his tree, he's got a newer version with
> better commit messages.
> 
> It's a significant performance boost on metadata heavy workloads for any
> non-XFS filesystem, we should definitely get it in.

I've got an up to date vfs-scale tree here (6.4-rc1) but I have not
been able to test it effectively right now because my local
performance test server is broken. I'll do what I can on the old
small machine that I have to validate it when I get time, but that
might be a few weeks away....

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

As it is, the inode hash-bl changes have zero impact on XFS because
it has it's own highly scalable lockless, sharded inode cache. So
unless I'm explicitly testing ext4 or btrfs scalability (rare) it's
not getting a lot of scalability exercise. It is being used by the
root filesytsems on all those test VMs, but that's about it...

That said, my vfs-scale tree also has Waiman Long's old dlist code
(per cpu linked list) which converts the sb inode list and removes
the global lock there. This does make a huge impact for XFS - the
current code limits inode cache cycling to about 600,000 inodes/sec
on >=16p machines. With dlists, however:

| 5.17.0 on a XFS filesystem with 50 million inodes in it on a 32p
| machine with a 1.6MIOPS/6.5GB/s block device.
| 
| Fully concurrent full filesystem bulkstat:
| 
| 		wall time	sys time	IOPS	BW	rate
| unpatched:	1m56.035s	56m12.234s	 8k     200MB/s	0.4M/s
| patched:	0m15.710s	 3m45.164s	70k	1.9GB/s 3.4M/s
| 
| Unpatched flat kernel profile:
| 
|   81.97%  [kernel]  [k] __pv_queued_spin_lock_slowpath
|    1.84%  [kernel]  [k] do_raw_spin_lock
|    1.33%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
|    0.50%  [kernel]  [k] memset_erms
|    0.42%  [kernel]  [k] do_raw_spin_unlock
|    0.42%  [kernel]  [k] xfs_perag_get
|    0.40%  [kernel]  [k] xfs_buf_find
|    0.39%  [kernel]  [k] __raw_spin_lock_init
| 
| Patched flat kernel profile:
| 
|   10.90%  [kernel]  [k] do_raw_spin_lock
|    7.21%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
|    3.16%  [kernel]  [k] xfs_buf_find
|    3.06%  [kernel]  [k] rcu_segcblist_enqueue
|    2.73%  [kernel]  [k] memset_erms
|    2.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
|    2.15%  [kernel]  [k] __raw_spin_lock_init
|    2.15%  [kernel]  [k] do_raw_spin_unlock
|    2.12%  [kernel]  [k] xfs_perag_get
|    1.93%  [kernel]  [k] xfs_btree_lookup

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:47             ` Matthew Wilcox
  2023-05-16 21:57               ` Kent Overstreet
@ 2023-05-17  5:28               ` Kent Overstreet
  2023-05-17 14:04                 ` Mike Rapoport
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-17  5:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> > 
> > I'm working up a new allocator - a (very simple) slab allocator where
> > you pass a buffer, and it gives you a copy of that buffer mapped
> > executable, but not writeable.
> > 
> > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > trampolines to it; it'll consolidate a fair amount of code (particularly
> > in bpf), and they won't have to burn a full page per allocation anymore.
> > 
> > bpf has a neat trick where it maps the same page in two different
> > locations, one is the executable location and the other is the writeable
> > location - I'm stealing that.
> 
> How does that avoid the problem of being able to construct an arbitrary
> gadget that somebody else will then execute?  IOW, what bpf has done
> seems like it's working around & undoing the security improvements.
> 
> I suppose it's an improvement that only the executable address is
> passed back to the caller, and not the writable address.

Ok, here's what I came up with. Have not tested all corner cases, still
need to write docs - but I think this gives us a nicer interface than
what bpf/kprobes/etc. have been doing, and it does the sub-page sized
allocations I need.

With an additional tweak to module_alloc() (not done in this patch yet)
we avoid ever mapping in pages both writeable and executable:

-->--

From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Wed, 17 May 2023 01:22:06 -0400
Subject: [PATCH] mm: jit/text allocator

This provides a new, very simple slab allocator for jit/text, i.e. bpf,
ftrace trampolines, or bcachefs unpack functions.

With this API we can avoid ever mapping pages both writeable and
executable (not implemented in this patch: need to tweak
module_alloc()), and it also supports sub-page sized allocations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 0000000000..f1549d60e8
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+void jit_update(void *buf, void *new_buf, size_t len);
+void jit_free(void *buf);
+void *jit_alloc(void *buf, size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f..ff26a4f0c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config JITALLOC
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index c03e1e5859..25e82db9e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JITALLOC) += jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 0000000000..7c4d621802
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/jitalloc.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/mutex.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+#include <asm/text-patching.h>
+
+static DEFINE_MUTEX(jit_alloc_lock);
+
+struct jit_cache {
+	unsigned		obj_size_bits;
+	unsigned		objs_per_slab;
+	struct list_head	partial;
+};
+
+#define JITALLOC_MIN_SIZE	16
+#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
+
+static struct jit_cache jit_caches[NR_JIT_CACHES];
+
+struct jit_slab {
+	unsigned long		__page_flags;
+
+	struct jit_cache	*cache;
+	void			*executably_mapped;;
+	unsigned long		*objs_allocated; /* bitmap of free objects */
+	struct list_head	list;
+};
+
+#define folio_jit_slab(folio)		(_Generic((folio),			\
+	const struct folio *:		(const struct jit_slab *)(folio),	\
+	struct folio *:			(struct jit_slab *)(folio)))
+
+#define jit_slab_folio(s)		(_Generic((s),				\
+	const struct jit_slab *:	(const struct folio *)s,		\
+	struct jit_slab *:		(struct folio *)s))
+
+static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
+{
+	void *executably_mapped = module_alloc(PAGE_SIZE);
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *slab;
+	unsigned long *objs_allocated;
+
+	if (!executably_mapped)
+		return NULL;
+
+	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
+	if (!objs_allocated ) {
+		vfree(executably_mapped);
+		return NULL;
+	}
+
+	set_vm_flush_reset_perms(executably_mapped);
+	set_memory_rox((unsigned long) executably_mapped, 1);
+
+	page = vmalloc_to_page(executably_mapped);
+	folio = page_folio(page);
+
+	__folio_set_slab(folio);
+	slab			= folio_jit_slab(folio);
+	slab->cache		= cache;
+	slab->executably_mapped	= executably_mapped;
+	slab->objs_allocated = objs_allocated;
+	INIT_LIST_HEAD(&slab->list);
+
+	return slab;
+}
+
+static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
+{
+	struct jit_slab *s =
+		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
+		jit_slab_alloc(cache);
+	unsigned obj_idx, nr_allocated;
+
+	if (!s)
+		return NULL;
+
+	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
+
+	BUG_ON(obj_idx >= cache->objs_per_slab);
+	__set_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == s->cache->objs_per_slab) {
+		list_del_init(&s->list);
+	} else if (nr_allocated == 1) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
+}
+
+void jit_update(void *buf, void *new_buf, size_t len)
+{
+	text_poke_copy(buf, new_buf, len);
+}
+EXPORT_SYMBOL_GPL(jit_update);
+
+void jit_free(void *buf)
+{
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *s;
+	unsigned obj_idx, nr_allocated;
+	size_t offset;
+
+	if (!buf)
+		return;
+
+	page	= vmalloc_to_page(buf);
+	folio	= page_folio(page);
+	offset	= offset_in_folio(folio, buf);
+
+	if (!folio_test_slab(folio)) {
+		vfree(buf);
+		return;
+	}
+
+	s = folio_jit_slab(folio);
+
+	mutex_lock(&jit_alloc_lock);
+	obj_idx = offset >> s->cache->obj_size_bits;
+
+	__clear_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == 0) {
+		list_del(&s->list);
+		kfree(s->objs_allocated);
+		folio_put(folio);
+	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	mutex_unlock(&jit_alloc_lock);
+}
+EXPORT_SYMBOL_GPL(jit_free);
+
+void *jit_alloc(void *buf, size_t len)
+{
+	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
+	void *p;
+
+	if (jit_cache_idx < NR_JIT_CACHES) {
+		mutex_lock(&jit_alloc_lock);
+		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
+		mutex_unlock(&jit_alloc_lock);
+	} else {
+		p = module_alloc(len);
+		if (p) {
+			set_vm_flush_reset_perms(p);
+			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
+		}
+	}
+
+	if (p && buf)
+		jit_update(p, buf, len);
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(jit_alloc);
+
+static int __init jit_alloc_init(void)
+{
+	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
+		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
+		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
+
+		INIT_LIST_HEAD(&jit_caches[i].partial);
+	}
+
+	return 0;
+}
+core_initcall(jit_alloc_init);

^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17  5:28               ` Kent Overstreet
@ 2023-05-17 14:04                 ` Mike Rapoport
  2023-05-17 14:18                   ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Mike Rapoport @ 2023-05-17 14:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Wed, May 17, 2023 at 01:28:43AM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> > On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > For something that small, why not use the text_poke API?
> > > 
> > > This looks like it's meant for patching existing kernel text, which
> > > isn't what I want - I'm generating new functions on the fly, one per
> > > btree node.
> > > 
> > > I'm working up a new allocator - a (very simple) slab allocator where
> > > you pass a buffer, and it gives you a copy of that buffer mapped
> > > executable, but not writeable.
> > > 
> > > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > > trampolines to it; it'll consolidate a fair amount of code (particularly
> > > in bpf), and they won't have to burn a full page per allocation anymore.
> > > 
> > > bpf has a neat trick where it maps the same page in two different
> > > locations, one is the executable location and the other is the writeable
> > > location - I'm stealing that.
> > 
> > How does that avoid the problem of being able to construct an arbitrary
> > gadget that somebody else will then execute?  IOW, what bpf has done
> > seems like it's working around & undoing the security improvements.
> > 
> > I suppose it's an improvement that only the executable address is
> > passed back to the caller, and not the writable address.
> 
> Ok, here's what I came up with. Have not tested all corner cases, still
> need to write docs - but I think this gives us a nicer interface than
> what bpf/kprobes/etc. have been doing, and it does the sub-page sized
> allocations I need.
> 
> With an additional tweak to module_alloc() (not done in this patch yet)
> we avoid ever mapping in pages both writeable and executable:
> 
> -->--
> 
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
> 
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
> 
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.

This looks like yet another workaround for that module_alloc() was not
designed to handle permission changes. Rather than create more and more
wrappers for module_alloc() we need to have core API for code allocation,
apparently on top of vmalloc, and then use that API for modules, bpf,
tracing and whatnot.

There was quite lengthy discussion about how to handle code allocations
here:

https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/
 
and Song is already working on improvements for module_alloc(), e.g. see
commit ac3b43283923 ("module: replace module_layout with module_memory")

Another thing, the code below will not even compile on !x86.

> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
> new file mode 100644
> index 0000000000..f1549d60e8
> --- /dev/null
> +++ b/include/linux/jitalloc.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_JITALLOC_H
> +#define _LINUX_JITALLOC_H
> +
> +void jit_update(void *buf, void *new_buf, size_t len);
> +void jit_free(void *buf);
> +void *jit_alloc(void *buf, size_t len);
> +
> +#endif /* _LINUX_JITALLOC_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4751031f3f..ff26a4f0c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
>  	  This option has a per-memcg and per-node memory overhead.
>  # }
>  
> +config JITALLOC
> +	bool
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index c03e1e5859..25e82db9e8 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
>  obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
>  obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
>  obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_JITALLOC) += jitalloc.o
> diff --git a/mm/jitalloc.c b/mm/jitalloc.c
> new file mode 100644
> index 0000000000..7c4d621802
> --- /dev/null
> +++ b/mm/jitalloc.c
> @@ -0,0 +1,187 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/gfp.h>
> +#include <linux/highmem.h>
> +#include <linux/jitalloc.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/mutex.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/text-patching.h>
> +
> +static DEFINE_MUTEX(jit_alloc_lock);
> +
> +struct jit_cache {
> +	unsigned		obj_size_bits;
> +	unsigned		objs_per_slab;
> +	struct list_head	partial;
> +};
> +
> +#define JITALLOC_MIN_SIZE	16
> +#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
> +
> +static struct jit_cache jit_caches[NR_JIT_CACHES];
> +
> +struct jit_slab {
> +	unsigned long		__page_flags;
> +
> +	struct jit_cache	*cache;
> +	void			*executably_mapped;;
> +	unsigned long		*objs_allocated; /* bitmap of free objects */
> +	struct list_head	list;
> +};
> +
> +#define folio_jit_slab(folio)		(_Generic((folio),			\
> +	const struct folio *:		(const struct jit_slab *)(folio),	\
> +	struct folio *:			(struct jit_slab *)(folio)))
> +
> +#define jit_slab_folio(s)		(_Generic((s),				\
> +	const struct jit_slab *:	(const struct folio *)s,		\
> +	struct jit_slab *:		(struct folio *)s))
> +
> +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
> +{
> +	void *executably_mapped = module_alloc(PAGE_SIZE);
> +	struct page *page;
> +	struct folio *folio;
> +	struct jit_slab *slab;
> +	unsigned long *objs_allocated;
> +
> +	if (!executably_mapped)
> +		return NULL;
> +
> +	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
> +	if (!objs_allocated ) {
> +		vfree(executably_mapped);
> +		return NULL;
> +	}
> +
> +	set_vm_flush_reset_perms(executably_mapped);
> +	set_memory_rox((unsigned long) executably_mapped, 1);
> +
> +	page = vmalloc_to_page(executably_mapped);
> +	folio = page_folio(page);
> +
> +	__folio_set_slab(folio);
> +	slab			= folio_jit_slab(folio);
> +	slab->cache		= cache;
> +	slab->executably_mapped	= executably_mapped;
> +	slab->objs_allocated = objs_allocated;
> +	INIT_LIST_HEAD(&slab->list);
> +
> +	return slab;
> +}
> +
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> +	struct jit_slab *s =
> +		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> +		jit_slab_alloc(cache);
> +	unsigned obj_idx, nr_allocated;
> +
> +	if (!s)
> +		return NULL;
> +
> +	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> +	BUG_ON(obj_idx >= cache->objs_per_slab);
> +	__set_bit(obj_idx, s->objs_allocated);
> +
> +	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +	if (nr_allocated == s->cache->objs_per_slab) {
> +		list_del_init(&s->list);
> +	} else if (nr_allocated == 1) {
> +		list_del(&s->list);
> +		list_add(&s->list, &s->cache->partial);
> +	}
> +
> +	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
> +
> +void jit_update(void *buf, void *new_buf, size_t len)
> +{
> +	text_poke_copy(buf, new_buf, len);
> +}
> +EXPORT_SYMBOL_GPL(jit_update);
> +
> +void jit_free(void *buf)
> +{
> +	struct page *page;
> +	struct folio *folio;
> +	struct jit_slab *s;
> +	unsigned obj_idx, nr_allocated;
> +	size_t offset;
> +
> +	if (!buf)
> +		return;
> +
> +	page	= vmalloc_to_page(buf);
> +	folio	= page_folio(page);
> +	offset	= offset_in_folio(folio, buf);
> +
> +	if (!folio_test_slab(folio)) {
> +		vfree(buf);
> +		return;
> +	}
> +
> +	s = folio_jit_slab(folio);
> +
> +	mutex_lock(&jit_alloc_lock);
> +	obj_idx = offset >> s->cache->obj_size_bits;
> +
> +	__clear_bit(obj_idx, s->objs_allocated);
> +
> +	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +	if (nr_allocated == 0) {
> +		list_del(&s->list);
> +		kfree(s->objs_allocated);
> +		folio_put(folio);
> +	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
> +		list_del(&s->list);
> +		list_add(&s->list, &s->cache->partial);
> +	}
> +
> +	mutex_unlock(&jit_alloc_lock);
> +}
> +EXPORT_SYMBOL_GPL(jit_free);
> +
> +void *jit_alloc(void *buf, size_t len)
> +{
> +	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
> +	void *p;
> +
> +	if (jit_cache_idx < NR_JIT_CACHES) {
> +		mutex_lock(&jit_alloc_lock);
> +		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
> +		mutex_unlock(&jit_alloc_lock);
> +	} else {
> +		p = module_alloc(len);
> +		if (p) {
> +			set_vm_flush_reset_perms(p);
> +			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
> +		}
> +	}
> +
> +	if (p && buf)
> +		jit_update(p, buf, len);
> +
> +	return p;
> +}
> +EXPORT_SYMBOL_GPL(jit_alloc);
> +
> +static int __init jit_alloc_init(void)
> +{
> +	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
> +		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
> +		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
> +
> +		INIT_LIST_HEAD(&jit_caches[i].partial);
> +	}
> +
> +	return 0;
> +}
> +core_initcall(jit_alloc_init);
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 14:04                 ` Mike Rapoport
@ 2023-05-17 14:18                   ` Kent Overstreet
  2023-05-17 15:44                     ` Mike Rapoport
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-17 14:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> On Wed, May 17, 2023 at 01:28:43AM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> > > On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > > For something that small, why not use the text_poke API?
> > > > 
> > > > This looks like it's meant for patching existing kernel text, which
> > > > isn't what I want - I'm generating new functions on the fly, one per
> > > > btree node.
> > > > 
> > > > I'm working up a new allocator - a (very simple) slab allocator where
> > > > you pass a buffer, and it gives you a copy of that buffer mapped
> > > > executable, but not writeable.
> > > > 
> > > > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > > > trampolines to it; it'll consolidate a fair amount of code (particularly
> > > > in bpf), and they won't have to burn a full page per allocation anymore.
> > > > 
> > > > bpf has a neat trick where it maps the same page in two different
> > > > locations, one is the executable location and the other is the writeable
> > > > location - I'm stealing that.
> > > 
> > > How does that avoid the problem of being able to construct an arbitrary
> > > gadget that somebody else will then execute?  IOW, what bpf has done
> > > seems like it's working around & undoing the security improvements.
> > > 
> > > I suppose it's an improvement that only the executable address is
> > > passed back to the caller, and not the writable address.
> > 
> > Ok, here's what I came up with. Have not tested all corner cases, still
> > need to write docs - but I think this gives us a nicer interface than
> > what bpf/kprobes/etc. have been doing, and it does the sub-page sized
> > allocations I need.
> > 
> > With an additional tweak to module_alloc() (not done in this patch yet)
> > we avoid ever mapping in pages both writeable and executable:
> > 
> > -->--
> > 
> > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > Date: Wed, 17 May 2023 01:22:06 -0400
> > Subject: [PATCH] mm: jit/text allocator
> > 
> > This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> > ftrace trampolines, or bcachefs unpack functions.
> > 
> > With this API we can avoid ever mapping pages both writeable and
> > executable (not implemented in this patch: need to tweak
> > module_alloc()), and it also supports sub-page sized allocations.
> 
> This looks like yet another workaround for that module_alloc() was not
> designed to handle permission changes. Rather than create more and more
> wrappers for module_alloc() we need to have core API for code allocation,
> apparently on top of vmalloc, and then use that API for modules, bpf,
> tracing and whatnot.
> 
> There was quite lengthy discussion about how to handle code allocations
> here:
> 
> https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/

Thanks for the link!

Added Song to the CC.

Song, I'm looking at your code now - switching to hugepages is great,
but I wonder if we might be able to combine our two approaches - with
the slab allocator I did, do we have to bother with VMAs at all? And
then it gets us sub-page sized allocations.

> and Song is already working on improvements for module_alloc(), e.g. see
> commit ac3b43283923 ("module: replace module_layout with module_memory")
> 
> Another thing, the code below will not even compile on !x86.

Due to text_poke(), which I see is abstracted better in that patchset.

I'm very curious why text_poke() does tlb flushing at all; it seems like
flush_icache_range() is actually what's needed?

text_poke() also only touching up to two pages, without that being
documented, is also a footgun...

And I'm really curious why text_poke() is needed at all. Seems like we
could just use kmap_local() to create a temporary writeable mapping,
except in my testing that got me a RO mapping. Odd.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 14:18                   ` Kent Overstreet
@ 2023-05-17 15:44                     ` Mike Rapoport
  2023-05-17 15:59                       ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Mike Rapoport @ 2023-05-17 15:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 10:18:11AM -0400, Kent Overstreet wrote:
> On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> 
> And I'm really curious why text_poke() is needed at all. Seems like we
> could just use kmap_local() to create a temporary writeable mapping,

On 64 bit kmap_local_page() is aliased to page_address() and does not map
anything. text_poke() is needed to actually create a temporary writable
mapping without touching page tables in vmalloc and/or direct map.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 15:44                     ` Mike Rapoport
@ 2023-05-17 15:59                       ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-17 15:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 06:44:12PM +0300, Mike Rapoport wrote:
> On Wed, May 17, 2023 at 10:18:11AM -0400, Kent Overstreet wrote:
> > On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> > 
> > And I'm really curious why text_poke() is needed at all. Seems like we
> > could just use kmap_local() to create a temporary writeable mapping,
> 
> On 64 bit kmap_local_page() is aliased to page_address() and does not map
> anything. text_poke() is needed to actually create a temporary writable
> mapping without touching page tables in vmalloc and/or direct map.

Duh - thanks!

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-15  7:26                           ` Kent Overstreet
@ 2023-05-21 21:33                             ` Eric Biggers
  2023-05-21 22:04                               ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Eric Biggers @ 2023-05-21 21:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Mon, May 15, 2023 at 03:26:43AM -0400, Kent Overstreet wrote:
> On Mon, May 15, 2023 at 12:13:43AM -0700, Eric Biggers wrote:
> > Sure, given that this is an optimization problem with a very small scope
> > (decoding 6 fields from a bitstream), I was hoping for something easier and
> > faster to iterate on than setting up a full kernel + bcachefs test environment
> > and reverse engineering 500 lines of shell script.  But sure, I can look into
> > that when I have a chance.
> 
> If you were actually wanting to help, that repository is the tool I use
> for kernel development and testing - it's got documentation.
> 
> It builds a kernel, boots a VM and runs a test in about 15 seconds, no
> need for lifting that code out to userspace.
> 

FYI, I had a go with your test framework today, but I ran into too many problems
to really bother with it.  In case you want to improve it, these are the
problems I ran into (so far).  The command I was trying to run, after having run
'./root_image create' as root as the README says to do, was
'build-test-kernel run -I ~/src/ktest/tests/bcachefs/perf.ktest':

- Error due to ~/src/ktest/tests/bcachefs/bcachefs-tools not existing.  Worked
  around by cloning the bcachefs-tools repo to this location.  (Note, it's not a
  git submodule, so updating the git submodules didn't help.)

- Error due to "Root image not found".  Worked around by recursively chown'ing
  /var/lib/ktest from root to my user.  (Note, the README says to run
  'root_image create' as root, which results in root ownership.)

- Error due to "cannot set up guest memory 'pc.ram': Cannot allocate memory".
  Worked around by editing tests/bcachefs/perf.ktest to change config-mem from
  32G to 16G.  (I have 32G memory total on this computer.)

- Error due to "failed to open /dev/vfio/10: No such file or directory".
  Enabling CONFIG_VFIO and CONFIG_VFIO_PCI in my host kernel didn't help.  It
  seems the test is hardcoded to expect PCI passthrough to be set up with a
  specific device.  I'd have expected it to just set up a standard virtual disk.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-21 21:33                             ` Eric Biggers
@ 2023-05-21 22:04                               ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-21 22:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Sun, May 21, 2023 at 02:33:34PM -0700, Eric Biggers wrote:
> FYI, I had a go with your test framework today, but I ran into too many problems
> to really bother with it.  In case you want to improve it, these are the
> problems I ran into (so far).  The command I was trying to run, after having run
> './root_image create' as root as the README says to do, was
> 'build-test-kernel run -I ~/src/ktest/tests/bcachefs/perf.ktest':

Thanks for giving it a shot...

> - Error due to ~/src/ktest/tests/bcachefs/bcachefs-tools not existing.  Worked
>   around by cloning the bcachefs-tools repo to this location.  (Note, it's not a
>   git submodule, so updating the git submodules didn't help.)

a require-git line was missing, fixed that...

> - Error due to "Root image not found".  Worked around by recursively chown'ing
>   /var/lib/ktest from root to my user.  (Note, the README says to run
>   'root_image create' as root, which results in root ownership.)

Not sure about this one - root ownership is supposed to be fine because
qemu opens the root image read only, we use qemu's block device
in-memory snapshot mode. Was it just not readable by your user?

> - Error due to "cannot set up guest memory 'pc.ram': Cannot allocate memory".
>   Worked around by editing tests/bcachefs/perf.ktest to change config-mem from
>   32G to 16G.  (I have 32G memory total on this computer.)
 
I think 32G is excessive for the tests that actually need to be in this
file, dropping that back to 16G.

> - Error due to "failed to open /dev/vfio/10: No such file or directory".
>   Enabling CONFIG_VFIO and CONFIG_VFIO_PCI in my host kernel didn't help.  It
>   seems the test is hardcoded to expect PCI passthrough to be set up with a
>   specific device.  I'd have expected it to just set up a standard virtual disk.

Some of the tests in that file do need a fast device, but the tests
we're interested in do not - I'll split that up.

I just pushed fixes for everything except the root_image issue if you
want to give it another go.

Cheers,
Kent

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-16 23:15         ` Dave Chinner
@ 2023-05-22 13:04           ` Christian Brauner
  0 siblings, 0 replies; 186+ messages in thread
From: Christian Brauner @ 2023-05-22 13:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Dave Chinner, Alexander Viro

On Wed, May 17, 2023 at 09:15:34AM +1000, Dave Chinner wrote:
> On Tue, May 16, 2023 at 12:17:04PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> > > On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> > > There's a bit of a backlog before I get around to looking at this but
> > > it'd be great if we'd have a few reviewers for this change.
> > 
> > It is well tested - it's been in the bcachefs tree for ages with zero
> > issues. I'm pulling it out of the bcachefs-prerequisites series though
> > since Dave's still got it in his tree, he's got a newer version with
> > better commit messages.
> > 
> > It's a significant performance boost on metadata heavy workloads for any
> > non-XFS filesystem, we should definitely get it in.
> 
> I've got an up to date vfs-scale tree here (6.4-rc1) but I have not
> been able to test it effectively right now because my local
> performance test server is broken. I'll do what I can on the old
> small machine that I have to validate it when I get time, but that
> might be a few weeks away....
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> 
> As it is, the inode hash-bl changes have zero impact on XFS because
> it has it's own highly scalable lockless, sharded inode cache. So
> unless I'm explicitly testing ext4 or btrfs scalability (rare) it's
> not getting a lot of scalability exercise. It is being used by the
> root filesytsems on all those test VMs, but that's about it...

I think there's a bunch of perf tests being run on -next. So we can
stuff it into a vfs.unstable.* branch and see what -next thinks of this
performance wise.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
@ 2023-05-23  9:27   ` Christian Brauner
  2023-05-23 22:53     ` Dave Chinner
  0 siblings, 1 reply; 186+ messages in thread
From: Christian Brauner @ 2023-05-23  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Alexander Viro

On Tue, 09 May 2023 12:56:45 -0400, Kent Overstreet wrote:
> In preparation for changing the inode hash table implementation.
>
>

This is interesting completely independent of bcachefs so we should give
it some testing.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[20/32] vfs: factor out inode hash head calculation
        https://git.kernel.org/vfs/vfs/c/b54a4516146d

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 21/32] hlist-bl: add hlist_bl_fake()
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
  2023-05-10  4:48   ` Dave Chinner
@ 2023-05-23  9:27   ` Christian Brauner
  1 sibling, 0 replies; 186+ messages in thread
From: Christian Brauner @ 2023-05-23  9:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet

On Tue, 09 May 2023 12:56:46 -0400, Kent Overstreet wrote:
> in preparation for switching the VFS inode cache over the hlist_bl
> lists, we nee dto be able to fake a list node that looks like it is
> hased for correct operation of filesystems that don't directly use
> the VFS indoe cache.
> 
> 

This is interesting completely independent of bcachefs so we should give
it some testing.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[21/32] hlist-bl: add hlist_bl_fake()
        https://git.kernel.org/vfs/vfs/c/0ef99590b01f

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
  2023-05-10  4:45   ` Dave Chinner
@ 2023-05-23  9:28   ` Christian Brauner
  2023-10-19 15:30     ` Mateusz Guzik
  1 sibling, 1 reply; 186+ messages in thread
From: Christian Brauner @ 2023-05-23  9:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Alexander Viro

On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> Because scalability of the global inode_hash_lock really, really
> sucks.
> 
> 32-way concurrent create on a couple of different filesystems
> before:
> 
> -   52.13%     0.04%  [kernel]            [k] ext4_create
>    - 52.09% ext4_create
>       - 41.03% __ext4_new_inode
>          - 29.92% insert_inode_locked
>             - 25.35% _raw_spin_lock
>                - do_raw_spin_lock
>                   - 24.97% __pv_queued_spin_lock_slowpath
> 
> [...]

This is interesting completely independent of bcachefs so we should give
it some testing.

I updated a few places that had outdated comments.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[22/32] vfs: inode cache conversion to hash-bl
        https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-10  6:18     ` Kent Overstreet
@ 2023-05-23 13:34       ` Jan Kara
  2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
                           ` (2 more replies)
  0 siblings, 3 replies; 186+ messages in thread
From: Jan Kara @ 2023-05-23 13:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Darrick J . Wong, dhowells, Andreas Gruenbacher,
	cluster-devel, Bob Peterson

On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > 
> > > This is used by bcachefs to fix a page cache coherency issue with
> > > O_DIRECT writes.
> > > 
> > > Also relevant: mapping->invalidate_lock, see below.
> > > 
> > > O_DIRECT writes (and other filesystem operations that modify file data
> > > while bypassing the page cache) need to shoot down ranges of the page
> > > cache - and additionally, need locking to prevent those pages from
> > > pulled back in.
> > > 
> > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > and the page fault handler will need to take that same lock - this is a
> > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > writing to and uses those pages for the buffer to write from, and it's a
> > > lock ordering deadlock in general.
> > > 
> > > Thus we need a way to signal from the dio code to the page fault handler
> > > when we already are holding the pagecache add lock on an address space -
> > > this patch just adds a member to task_struct for this purpose. For now
> > > only bcachefs is implementing this locking, though it may be moved out
> > > of bcachefs and made available to other filesystems in the future.
> > 
> > It would be nice to have at least a link to the code that's actually using
> > the field you are adding.
> 
> Bit of a trick to link to a _later_ patch in the series from a commit
> message, but...
> 
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

Thanks and I'm sorry for the delay.

> > Also I think we were already through this discussion [1] and we ended up
> > agreeing that your scheme actually solves only the AA deadlock but a
> > malicious userspace can easily create AB BA deadlock by running direct IO
> > to file A using mapped file B as a buffer *and* direct IO to file B using
> > mapped file A as a buffer.
> 
> No, that's definitely handled (and you can see it in the code I linked),
> and I wrote a torture test for fstests as well.

I've checked the code and AFAICT it is all indeed handled. BTW, I've now
remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
way (by prefaulting pages from the iter before grabbing the problematic
lock and then disabling page faults for the iomap_dio_rw() call). I guess
we should somehow unify these schemes so that we don't have two mechanisms
for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Also good that you've written a fstest for this, that is definitely a useful
addition, although I suspect GFS2 guys added a test for this not so long
ago when testing their stuff. Maybe they have a pointer handy?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 13:34       ` Jan Kara
@ 2023-05-23 16:21         ` Christoph Hellwig
  2023-05-23 16:35           ` Kent Overstreet
  2023-05-25 22:25           ` Andreas Grünbacher
  2023-05-23 16:49         ` Kent Overstreet
  2023-05-25 22:04         ` Andreas Grünbacher
  2 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-23 16:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kent Overstreet, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> 
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

generic/708 is the btrfs version of this.

But I think all of the file systems that have this deadlock are actually
fundamentally broken because they have a mess up locking hierarchy
where page faults take the same lock that is held over the the direct I/
operation.  And the right thing is to fix this.  I have work in progress
for btrfs, and something similar should apply to gfs2, with the added
complication that it probably means a revision to their network
protocol.

I'm absolutely not in favour to add workarounds for thes kind of locking
problems to the core kernel.  I already feel bad for allowing the
small workaround in iomap for btrfs, as just fixing the locking back
then would have avoid massive ratholing.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
@ 2023-05-23 16:35           ` Kent Overstreet
  2023-05-24  6:43             ` Christoph Hellwig
  2023-05-25 22:25           ` Andreas Grünbacher
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-23 16:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Tue, May 23, 2023 at 09:21:56AM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > 
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
> 
> generic/708 is the btrfs version of this.
> 
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

No, this is fundamentally because userspace controls the ordering of
locking because the buffer passed to dio can point into any address
space. You can't solve this by changing the locking heirarchy.

If you want to be able to have locking around adding things to the
pagecache so that things that bypass the pagecache can prevent
inconsistencies (and we do, the big one is fcollapse), and if you want
dio to be able to use that same locking (because otherwise dio will also
cause page cache inconsistency), this is the way to do it.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 13:34       ` Jan Kara
  2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
@ 2023-05-23 16:49         ` Kent Overstreet
  2023-05-25  8:47           ` Jan Kara
  2023-05-25 22:04         ` Andreas Grünbacher
  2 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-23 16:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Darrick J . Wong, dhowells, Andreas Gruenbacher, cluster-devel,
	Bob Peterson

> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
> 
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Oof, that sounds a bit sketchy. What happens if the dio call passes in
an address from the same address space? What happens if we race with the
pages we faulted in being evicted?

> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

More tests more good.

So if we want to lift this scheme to the VFS layer, we'd start by
replacing the lock you added (grepping for it, the name escapes me) with
a different type of lock - two_state_shared_lock in my code, it's like a
rw lock except writers don't exclude other writers. That way the DIO
path can use it without singlethreading writes to a single file.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-23  9:27   ` (subset) " Christian Brauner
@ 2023-05-23 22:53     ` Dave Chinner
  2023-05-24  6:44       ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-05-23 22:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Alexander Viro

On Tue, May 23, 2023 at 11:27:06AM +0200, Christian Brauner wrote:
> On Tue, 09 May 2023 12:56:45 -0400, Kent Overstreet wrote:
> > In preparation for changing the inode hash table implementation.
> >
> >
> 
> This is interesting completely independent of bcachefs so we should give
> it some testing.
> 
> ---
> 
> Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> 
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
> 
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.unstable.inode-hash
> 
> [20/32] vfs: factor out inode hash head calculation
>         https://git.kernel.org/vfs/vfs/c/b54a4516146d

Hi Christian - I suspect you should pull the latest version of these
patches from:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

The commit messages are more recent and complete, and I've been
testing the branch in all my test kernels since 6.4-rc1 without
issues.

There's also the dlist-lock stuff for avoiding s_inode_list_lock
contention in that branch. Once the global hash lock is removed,
the s_inode_list_lock is the only global lock in the inode
instantiation and reclaim paths. It nests inside the hash locks, so
all the contention is currently taken on the hash locks - remove the
global hash locks and we just contend on the next global cache
line and the workload doesn't go any faster.

i.e. to see the full benefit of the inode hash lock contention
reduction, we also need the sb->s_inode_list_lock contention to be
fixed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 16:35           ` Kent Overstreet
@ 2023-05-24  6:43             ` Christoph Hellwig
  2023-05-24  8:09               ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-24  6:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> No, this is fundamentally because userspace controls the ordering of
> locking because the buffer passed to dio can point into any address
> space. You can't solve this by changing the locking heirarchy.
> 
> If you want to be able to have locking around adding things to the
> pagecache so that things that bypass the pagecache can prevent
> inconsistencies (and we do, the big one is fcollapse), and if you want
> dio to be able to use that same locking (because otherwise dio will also
> cause page cache inconsistency), this is the way to do it.

Well, it seems like you are talking about something else than the
existing cases in gfs2 and btrfs, that is you want full consistency
between direct I/O and buffered I/O.  That's something nothing in the
kernel has ever provided, so I'd be curious why you think you need it
and want different semantics from everyone else?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-23 22:53     ` Dave Chinner
@ 2023-05-24  6:44       ` Christoph Hellwig
  2023-05-24  7:35         ` Dave Chinner
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-24  6:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> Hi Christian - I suspect you should pull the latest version of these
> patches from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> 
> The commit messages are more recent and complete, and I've been
> testing the branch in all my test kernels since 6.4-rc1 without
> issues.

Can you please send the series to linux-fsdevel for review?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-24  6:44       ` Christoph Hellwig
@ 2023-05-24  7:35         ` Dave Chinner
  2023-05-24  8:31           ` Christian Brauner
  0 siblings, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-05-24  7:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On Tue, May 23, 2023 at 11:44:03PM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> > Hi Christian - I suspect you should pull the latest version of these
> > patches from:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> > 
> > The commit messages are more recent and complete, and I've been
> > testing the branch in all my test kernels since 6.4-rc1 without
> > issues.
> 
> Can you please send the series to linux-fsdevel for review?

When it gets back to the top of my priority pile. Last time I sent
it there was zero interest in reviewing it from fs/vfs developers
but it attracted lots of obnoxious shouting from some RTPREEMPT
people about using bit locks. If there's interest in getting it
merged, then I can add it to my backlog of stuff to do...

As it is, I'm buried layers deep right now, so I really have no
bandwidth to deal with this in the foreseeable future. The code is
there, it works just fine, if you want to push it through the
process of getting it merged, you're more than welcome to do so.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-24  6:43             ` Christoph Hellwig
@ 2023-05-24  8:09               ` Kent Overstreet
  2023-05-25  8:58                 ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-24  8:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Tue, May 23, 2023 at 11:43:32PM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> > No, this is fundamentally because userspace controls the ordering of
> > locking because the buffer passed to dio can point into any address
> > space. You can't solve this by changing the locking heirarchy.
> > 
> > If you want to be able to have locking around adding things to the
> > pagecache so that things that bypass the pagecache can prevent
> > inconsistencies (and we do, the big one is fcollapse), and if you want
> > dio to be able to use that same locking (because otherwise dio will also
> > cause page cache inconsistency), this is the way to do it.
> 
> Well, it seems like you are talking about something else than the
> existing cases in gfs2 and btrfs, that is you want full consistency
> between direct I/O and buffered I/O.  That's something nothing in the
> kernel has ever provided, so I'd be curious why you think you need it
> and want different semantics from everyone else?

Because I like code that is correct.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-24  7:35         ` Dave Chinner
@ 2023-05-24  8:31           ` Christian Brauner
  2023-05-24  8:41             ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christian Brauner @ 2023-05-24  8:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On Wed, May 24, 2023 at 05:35:02PM +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 11:44:03PM -0700, Christoph Hellwig wrote:
> > On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> > > Hi Christian - I suspect you should pull the latest version of these
> > > patches from:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> > > 
> > > The commit messages are more recent and complete, and I've been
> > > testing the branch in all my test kernels since 6.4-rc1 without
> > > issues.
> > 
> > Can you please send the series to linux-fsdevel for review?
> 
> When it gets back to the top of my priority pile. Last time I sent
> it there was zero interest in reviewing it from fs/vfs developers
> but it attracted lots of obnoxious shouting from some RTPREEMPT
> people about using bit locks. If there's interest in getting it

I think there is given that it seems to have nice perf gains.

> merged, then I can add it to my backlog of stuff to do...
> 
> As it is, I'm buried layers deep right now, so I really have no
> bandwidth to deal with this in the foreseeable future. The code is
> there, it works just fine, if you want to push it through the
> process of getting it merged, you're more than welcome to do so.

I'm here to help get more review done and pick stuff up. I won't be able
to do it without additional reviewers such as Christoph helping of
course as this isn't a one-man show.

Let's see if we can get this reviewed. If you have the bandwith to send
it to fsdevel that'd be great. If it takes you a while to get back to it
then that's fine too.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-24  8:31           ` Christian Brauner
@ 2023-05-24  8:41             ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-24  8:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Christoph Hellwig, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-bcachefs, Alexander Viro

On Wed, May 24, 2023 at 10:31:14AM +0200, Christian Brauner wrote:
> I'm here to help get more review done and pick stuff up. I won't be able
> to do it without additional reviewers such as Christoph helping of
> course as this isn't a one-man show.
> 
> Let's see if we can get this reviewed. If you have the bandwith to send
> it to fsdevel that'd be great. If it takes you a while to get back to it
> then that's fine too.

These patches really should have my reviewed-by on them already, I
stared at them quite a bit (and fs/inode.c in general) awhile back.

(I was attempting to convert fs/inode.c to an rhashtable up until I
realized the inode lifetime rules are completely insane, so when I saw
Dave's much simpler approach I was _more_ than happy to not have to
contemplate that mess anymore...)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 16:49         ` Kent Overstreet
@ 2023-05-25  8:47           ` Jan Kara
  2023-05-25 21:36             ` Kent Overstreet
  2023-05-25 22:45             ` Andreas Grünbacher
  0 siblings, 2 replies; 186+ messages in thread
From: Jan Kara @ 2023-05-25  8:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Darrick J . Wong, dhowells, Andreas Gruenbacher,
	cluster-devel, Bob Peterson

On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > No, that's definitely handled (and you can see it in the code I linked),
> > > and I wrote a torture test for fstests as well.
> > 
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> 
> Oof, that sounds a bit sketchy. What happens if the dio call passes in
> an address from the same address space?

If we submit direct IO that uses mapped file F at offset O as a buffer for
direct IO from file F, offset O, it will currently livelock in an
indefinite retry loop. It should rather return error or fall back to
buffered IO. But that should be fixable. Andreas?

But if the buffer and direct IO range does not overlap, it will just
happily work - iomap_dio_rw() invalidates only the range direct IO is done
to.

> What happens if we race with the pages we faulted in being evicted?

We fault them in again and retry.

> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
> 
> More tests more good.
> 
> So if we want to lift this scheme to the VFS layer, we'd start by
> replacing the lock you added (grepping for it, the name escapes me) with
> a different type of lock - two_state_shared_lock in my code, it's like a
> rw lock except writers don't exclude other writers. That way the DIO
> path can use it without singlethreading writes to a single file.

Yes, I've noticed that you are introducing in bcachefs a lock with very
similar semantics to mapping->invalidate_lock, just with this special lock
type. What I'm kind of worried about with two_state_shared_lock as
implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
heavily faulting pages on a file, direct IO to that file can be starved
indefinitely. That is IMHO not a good thing and I would not like to use
this type of lock in VFS until this problem is resolved. But it should be
fixable e.g. by introducing some kind of deadline for a waiter after which
it will block acquisitions of the other lock state.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-24  8:09               ` Kent Overstreet
@ 2023-05-25  8:58                 ` Christoph Hellwig
  2023-05-25 20:50                   ` Kent Overstreet
  2023-05-25 21:40                   ` Kent Overstreet
  0 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-25  8:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > Well, it seems like you are talking about something else than the
> > existing cases in gfs2 and btrfs, that is you want full consistency
> > between direct I/O and buffered I/O.  That's something nothing in the
> > kernel has ever provided, so I'd be curious why you think you need it
> > and want different semantics from everyone else?
> 
> Because I like code that is correct.

Well, start with explaining your definition of correctness, why everyone
else is "not correct", an how you can help fixing this correctness
problem in the existing kernel.  Thanks for your cooperation!

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25  8:58                 ` Christoph Hellwig
@ 2023-05-25 20:50                   ` Kent Overstreet
  2023-05-26  8:06                     ` Christoph Hellwig
  2023-05-25 21:40                   ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-05-25 20:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O.  That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> > 
> > Because I like code that is correct.
> 
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel.  Thanks for your cooperation!

A cache that isn't actually consistent is a _bug_. You're being
Obsequious. And any time this has come up in previous discussions
(including at LSF), that was never up for debate, the only question has
been whether it was even possible to practically fix it.

The DIO code recognizes cache incoherency as something to be avoided by
shooting down the page cache both at the beginning of the IO _and again
at the end_.  That's the kind of obvious hackery for a race condition
that we would like to avoid.

Regarding the consequences of this kind of bug - stale data exposed to
userspace, possibly stale data overwriting a write we acked, and worse
any filesystem state that hangs off the page cache being inconsistent
with the data on disk.

And look, we've been over all this before, so I don't see what this adds
to the discussion.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25  8:47           ` Jan Kara
@ 2023-05-25 21:36             ` Kent Overstreet
  2023-05-25 22:45             ` Andreas Grünbacher
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-25 21:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Darrick J . Wong, dhowells, Andreas Gruenbacher, cluster-devel,
	Bob Peterson

On Thu, May 25, 2023 at 10:47:31AM +0200, Jan Kara wrote:
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?
> 
> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.

*nod*

readahead triggered from the page fault path is another consideration.
No idea how that interacts with the gf2s method; IIRC there's a hack in
the page fault path that says somewhere "we may be getting called via
gup(), don't invoke readahead".

We could potentially kill that hack if we lifted this to the VFS layer.

> 
> > What happens if we race with the pages we faulted in being evicted?
> 
> We fault them in again and retry.
> 
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> > 
> > More tests more good.
> > 
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
> 
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.

Yeah, my two_state_shared lock is definitely at the quick and dirty
prototype level, the implementation would need work. Lockdep support
would be another hard requirement.

The deadline might be a good idea, OTOH it'd want tuning. Maybe
something like what rwsem does where we block new read acquirerers if
there's a writer waiting would work.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25  8:58                 ` Christoph Hellwig
  2023-05-25 20:50                   ` Kent Overstreet
@ 2023-05-25 21:40                   ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-25 21:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O.  That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> > 
> > Because I like code that is correct.
> 
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel.  Thanks for your cooperation!

BTW, if you wanted a more serious answer, just asking for the commit
message to be expanded would be a better way to ask...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 13:34       ` Jan Kara
  2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
  2023-05-23 16:49         ` Kent Overstreet
@ 2023-05-25 22:04         ` Andreas Grünbacher
  2 siblings, 0 replies; 186+ messages in thread
From: Andreas Grünbacher @ 2023-05-25 22:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Darrick J . Wong, dhowells, Andreas Gruenbacher,
	cluster-devel, Bob Peterson

Am Di., 23. Mai 2023 um 15:37 Uhr schrieb Jan Kara <jack@suse.cz>:
> On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > >
> > > > This is used by bcachefs to fix a page cache coherency issue with
> > > > O_DIRECT writes.
> > > >
> > > > Also relevant: mapping->invalidate_lock, see below.
> > > >
> > > > O_DIRECT writes (and other filesystem operations that modify file data
> > > > while bypassing the page cache) need to shoot down ranges of the page
> > > > cache - and additionally, need locking to prevent those pages from
> > > > pulled back in.
> > > >
> > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > > and the page fault handler will need to take that same lock - this is a
> > > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > > writing to and uses those pages for the buffer to write from, and it's a
> > > > lock ordering deadlock in general.
> > > >
> > > > Thus we need a way to signal from the dio code to the page fault handler
> > > > when we already are holding the pagecache add lock on an address space -
> > > > this patch just adds a member to task_struct for this purpose. For now
> > > > only bcachefs is implementing this locking, though it may be moved out
> > > > of bcachefs and made available to other filesystems in the future.
> > >
> > > It would be nice to have at least a link to the code that's actually using
> > > the field you are adding.
> >
> > Bit of a trick to link to a _later_ patch in the series from a commit
> > message, but...
> >
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454
>
> Thanks and I'm sorry for the delay.
>
> > > Also I think we were already through this discussion [1] and we ended up
> > > agreeing that your scheme actually solves only the AA deadlock but a
> > > malicious userspace can easily create AB BA deadlock by running direct IO
> > > to file A using mapped file B as a buffer *and* direct IO to file B using
> > > mapped file A as a buffer.
> >
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
>
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

Ah yes, that's xfstests commit d3cbdabf ("generic: Test page faults
during read and write").

Thanks,
Andreas

>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
  2023-05-23 16:35           ` Kent Overstreet
@ 2023-05-25 22:25           ` Andreas Grünbacher
  2023-05-25 23:20             ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Andreas Grünbacher @ 2023-05-25 22:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Kent Overstreet, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <hch@infradead.org>:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> generic/708 is the btrfs version of this.
>
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

We do disable page faults, and there can be deadlocks in page fault
handlers while no page faults are allowed.

I'm roughly aware of the locking hierarchy that other filesystems use,
and that's something we want to avoid because of two reasons: (1) it
would be an incompatible change, and (2) we want to avoid cluster-wide
locking operations as much as possible because they are very slow.

These kinds of locking conflicts are so rare in practice that the
theoretical inefficiency of having to retry the operation doesn't
matter.

> I'm absolutely not in favour to add workarounds for thes kind of locking
> problems to the core kernel.  I already feel bad for allowing the
> small workaround in iomap for btrfs, as just fixing the locking back
> then would have avoid massive ratholing.

Please let me know when those btrfs changes are in a presentable shape ...

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25  8:47           ` Jan Kara
  2023-05-25 21:36             ` Kent Overstreet
@ 2023-05-25 22:45             ` Andreas Grünbacher
  1 sibling, 0 replies; 186+ messages in thread
From: Andreas Grünbacher @ 2023-05-25 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Darrick J . Wong, dhowells, Andreas Gruenbacher,
	cluster-devel, Bob Peterson

Am Do., 25. Mai 2023 um 10:56 Uhr schrieb Jan Kara <jack@suse.cz>:
> On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > > No, that's definitely handled (and you can see it in the code I linked),
> > > > and I wrote a torture test for fstests as well.
> > >
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Oof, that sounds a bit sketchy. What happens if the dio call passes in
> > an address from the same address space?
>
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?

Yes, I guess. Thanks for the heads-up.

Andreas

> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.
>
> > What happens if we race with the pages we faulted in being evicted?
>
> We fault them in again and retry.
>
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > More tests more good.
> >
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
>
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25 22:25           ` Andreas Grünbacher
@ 2023-05-25 23:20             ` Kent Overstreet
  2023-05-26  0:05               ` Andreas Grünbacher
  2023-05-26  8:10               ` Christoph Hellwig
  0 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-25 23:20 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <hch@infradead.org>:
> > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > >
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > generic/708 is the btrfs version of this.
> >
> > But I think all of the file systems that have this deadlock are actually
> > fundamentally broken because they have a mess up locking hierarchy
> > where page faults take the same lock that is held over the the direct I/
> > operation.  And the right thing is to fix this.  I have work in progress
> > for btrfs, and something similar should apply to gfs2, with the added
> > complication that it probably means a revision to their network
> > protocol.
> 
> We do disable page faults, and there can be deadlocks in page fault
> handlers while no page faults are allowed.
> 
> I'm roughly aware of the locking hierarchy that other filesystems use,
> and that's something we want to avoid because of two reasons: (1) it
> would be an incompatible change, and (2) we want to avoid cluster-wide
> locking operations as much as possible because they are very slow.
> 
> These kinds of locking conflicts are so rare in practice that the
> theoretical inefficiency of having to retry the operation doesn't
> matter.

Would you be willing to expand on that? I'm wondering if this would
simplify things for gfs2, but you mention locking heirarchy being an
incompatible change - how does that work?

> 
> > I'm absolutely not in favour to add workarounds for thes kind of locking
> > problems to the core kernel.  I already feel bad for allowing the
> > small workaround in iomap for btrfs, as just fixing the locking back
> > then would have avoid massive ratholing.
> 
> Please let me know when those btrfs changes are in a presentable shape ...

I would also be curious to know what btrfs needs and what the approach
is there.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25 23:20             ` Kent Overstreet
@ 2023-05-26  0:05               ` Andreas Grünbacher
  2023-05-26  0:39                 ` Kent Overstreet
  2023-05-26  8:10               ` Christoph Hellwig
  1 sibling, 1 reply; 186+ messages in thread
From: Andreas Grünbacher @ 2023-05-26  0:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

Am Fr., 26. Mai 2023 um 01:20 Uhr schrieb Kent Overstreet
<kent.overstreet@linux.dev>:
> On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <hch@infradead.org>:
> > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > > way (by prefaulting pages from the iter before grabbing the problematic
> > > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > > we should somehow unify these schemes so that we don't have two mechanisms
> > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > > >
> > > > Also good that you've written a fstest for this, that is definitely a useful
> > > > addition, although I suspect GFS2 guys added a test for this not so long
> > > > ago when testing their stuff. Maybe they have a pointer handy?
> > >
> > > generic/708 is the btrfs version of this.
> > >
> > > But I think all of the file systems that have this deadlock are actually
> > > fundamentally broken because they have a mess up locking hierarchy
> > > where page faults take the same lock that is held over the the direct I/
> > > operation.  And the right thing is to fix this.  I have work in progress
> > > for btrfs, and something similar should apply to gfs2, with the added
> > > complication that it probably means a revision to their network
> > > protocol.
> >
> > We do disable page faults, and there can be deadlocks in page fault
> > handlers while no page faults are allowed.
> >
> > I'm roughly aware of the locking hierarchy that other filesystems use,
> > and that's something we want to avoid because of two reasons: (1) it
> > would be an incompatible change, and (2) we want to avoid cluster-wide
> > locking operations as much as possible because they are very slow.
> >
> > These kinds of locking conflicts are so rare in practice that the
> > theoretical inefficiency of having to retry the operation doesn't
> > matter.
>
> Would you be willing to expand on that? I'm wondering if this would
> simplify things for gfs2, but you mention locking heirarchy being an
> incompatible change - how does that work?

Oh, it's just that gfs2 uses one dlm lock per inode to control access
to that inode. In the code, this is called the "inode glock" ---
glocks being an abstraction above dlm locks --- but it boils down to
dlm locks in the end. An additional layer of locking will only work
correctly if all cluster nodes use the new locks consistently, so old
cluster nodes will become incompatible. Those kinds of changes are
hard.

But the additional lock taking would also hurt performance, forever,
and I'd really like to avoid taking that hit.

It may not be obvious to everyone, but allowing page faults during
reads and writes (i.e., while holding dlm locks) can lead to
distributed deadlocks. We cannot just pretend to be a local
filesystem.

Thanks,
Andreas

> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel.  I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> >
> > Please let me know when those btrfs changes are in a presentable shape ...
>
> I would also be curious to know what btrfs needs and what the approach
> is there.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-26  0:05               ` Andreas Grünbacher
@ 2023-05-26  0:39                 ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-26  0:39 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote:
> Oh, it's just that gfs2 uses one dlm lock per inode to control access
> to that inode. In the code, this is called the "inode glock" ---
> glocks being an abstraction above dlm locks --- but it boils down to
> dlm locks in the end. An additional layer of locking will only work
> correctly if all cluster nodes use the new locks consistently, so old
> cluster nodes will become incompatible. Those kinds of changes are
> hard.
> 
> But the additional lock taking would also hurt performance, forever,
> and I'd really like to avoid taking that hit.
> 
> It may not be obvious to everyone, but allowing page faults during
> reads and writes (i.e., while holding dlm locks) can lead to
> distributed deadlocks. We cannot just pretend to be a local
> filesystem.

Ah, gotcha. Same basic issue then, dio -> page fault means you're taking
two DLM locks simulateously, so without lock ordering you get deadlock.

So that means promoting this to the VFS won't be be useful to you
(because the VFS lock will be a local lock, not your DLM lock).

Do you have trylock() and a defined lock ordering you can check for or
glocks (inode number)? Here's the new and expanded commit message, in
case my approach is of interest to you:

commit 32858d0fb90b503c8c39e899ad5155e44639d3b5
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date:   Wed Oct 16 15:03:50 2019 -0400

    sched: Add task_struct->faults_disabled_mapping
    
    There has been a long standing page cache coherence bug with direct IO.
    This provides part of a mechanism to fix it, currently just used by
    bcachefs but potentially worth promoting to the VFS.
    
    Direct IO evicts the range of the pagecache being read or written to.
    
    For reads, we need dirty pages to be written to disk, so that the read
    doesn't return stale data. For writes, we need to evict that range of
    the pagecache so that it's not stale after the write completes.
    
    However, without a locking mechanism to prevent those pages from being
    re-added to the pagecache - by a buffered read or page fault - page
    cache inconsistency is still possible.
    
    This isn't necessarily just an issue for userspace when they're playing
    games; filesystems may hang arbitrary state off the pagecache, and so
    page cache inconsistency may cause real filesystem bugs, depending on
    the filesystem. This is less of an issue for iomap based filesystems,
    but e.g. buffer heads caches disk block mappings (!) and attaches them
    to the pagecache, and bcachefs attaches disk reservations to pagecache
    pages.
    
    This issue has been hard to fix, because
     - we need to add a lock (henceforth calld pagecache_add_lock), which
       would be held for the duration of the direct IO
     - page faults add pages to the page cache, thus need to take the same
       lock
     - dio -> gup -> page fault thus can deadlock
    
    And we cannot enforce a lock ordering with this lock, since userspace
    will be controlling the lock ordering (via the fd and buffer arguments
    to direct IOs), so we need a different method of deadlock avoidance.
    
    We need to tell the page fault handler that we're already holding a
    pagecache_add_lock, and since plumbing it through the entire gup() path
    would be highly impractical this adds a field to task_struct.
    
    Then the full method is:
     - in the dio path, when we take first pagecache_add_lock, note the
       mapping in task_struct
     - in the page fault handler, if faults_disabled_mapping is set, we
       check if it's the same mapping as the one taking a page fault for,
       and if so return an error.
    
       Then we check lock ordering: if there's a lock ordering violation and
       trylock fails, we'll have to cycle the locks and return an error that
       tells the DIO path to retry: faults_disabled_mapping is also used for
       signalling "locks were dropped, please retry".
    
    Also relevant to this patch: mapping->invalidate_lock.
    mapping->invalidate_lock provides most of the required semantics - it's
    used by truncate/fallocate to block pages being added to the pagecache.
    However, since it's a rwsem, direct IOs would need to take the write
    side in order to block page cache adds, and would then be exclusive with
    each other - we'll need a new type of lock to pair with this approach.
    
    Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Andreas Grünbacher <andreas.gruenbacher@gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25 20:50                   ` Kent Overstreet
@ 2023-05-26  8:06                     ` Christoph Hellwig
  2023-05-26  8:34                       ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:06 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jan Kara, cluster-devel, Darrick J . Wong,
	linux-kernel, dhowells, linux-bcachefs, linux-fsdevel,
	Kent Overstreet

On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> A cache that isn't actually consistent is a _bug_. You're being
> Obsequious. And any time this has come up in previous discussions
> (including at LSF), that was never up for debate, the only question has
> been whether it was even possible to practically fix it.

That is not my impression.  But again, if you think it is useful,
go ahead and seel people on the idea.  But please prepare a series
that includes the rationale, performance tradeoffs and real live
implications for it.  And do it on the existing code that people use
and not just your shiny new thing.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-25 23:20             ` Kent Overstreet
  2023-05-26  0:05               ` Andreas Grünbacher
@ 2023-05-26  8:10               ` Christoph Hellwig
  2023-05-26  8:38                 ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andreas Grünbacher, Christoph Hellwig, Jan Kara,
	cluster-devel, Darrick J . Wong, linux-kernel, dhowells,
	linux-bcachefs, linux-fsdevel, Kent Overstreet

On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel.  I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> > 
> > Please let me know when those btrfs changes are in a presentable shape ...
> 
> I would also be curious to know what btrfs needs and what the approach
> is there.

btrfs has the extent locked, where "extent locked" is a somewhat magic
range lock that actually includes different lock bits.  It does so
because it clears the page writeback bit when the data made it to the
media, but before the metadata required to find it is commited, and
the extent lock prevents it from trying to do a readpage on something
that has actually very recently been written back but not fully
commited.  Once btrfs is changed to only clear the page writeback bit
once the write is fully commited like in other file systems this extra
level of locking can go away, and there are no more locks in the
readpage path that are also taken by the direct I/O code.  With that
a lot of code in btrfs working around this can go away, including the
no fault direct I/O code.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-26  8:06                     ` Christoph Hellwig
@ 2023-05-26  8:34                       ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-26  8:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, cluster-devel, Darrick J . Wong, linux-kernel,
	dhowells, linux-bcachefs, linux-fsdevel, Kent Overstreet

On Fri, May 26, 2023 at 01:06:46AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> > A cache that isn't actually consistent is a _bug_. You're being
> > Obsequious. And any time this has come up in previous discussions
> > (including at LSF), that was never up for debate, the only question has
> > been whether it was even possible to practically fix it.
> 
> That is not my impression.  But again, if you think it is useful,
> go ahead and seel people on the idea.  But please prepare a series
> that includes the rationale, performance tradeoffs and real live
> implications for it.  And do it on the existing code that people use
> and not just your shiny new thing.

When I'm ready to lift this to the VFS level I will; it should simplify
locking overall and it'll be one less thing for people to worry about.

(i.e. the fact that even _readahead_ can pull in pages a dio is
invalidating is a really nice footgun if not worked around).

Right now though I've got more than enough on my plate just trying to
finally get bcachefs merged, I'm happy to explain what this is for but
I'm not ready for additional headaches or projects yet.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-26  8:10               ` Christoph Hellwig
@ 2023-05-26  8:38                 ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-05-26  8:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Grünbacher, Jan Kara, cluster-devel,
	Darrick J . Wong, linux-kernel, dhowells, linux-bcachefs,
	linux-fsdevel, Kent Overstreet

On Fri, May 26, 2023 at 01:10:43AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > > problems to the core kernel.  I already feel bad for allowing the
> > > > small workaround in iomap for btrfs, as just fixing the locking back
> > > > then would have avoid massive ratholing.
> > > 
> > > Please let me know when those btrfs changes are in a presentable shape ...
> > 
> > I would also be curious to know what btrfs needs and what the approach
> > is there.
> 
> btrfs has the extent locked, where "extent locked" is a somewhat magic
> range lock that actually includes different lock bits.  It does so
> because it clears the page writeback bit when the data made it to the
> media, but before the metadata required to find it is commited, and
> the extent lock prevents it from trying to do a readpage on something
> that has actually very recently been written back but not fully
> commited.  Once btrfs is changed to only clear the page writeback bit
> once the write is fully commited like in other file systems this extra
> level of locking can go away, and there are no more locks in the
> readpage path that are also taken by the direct I/O code.  With that
> a lot of code in btrfs working around this can go away, including the
> no fault direct I/O code.

wow, yeah, that is not how that is supposed to work...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/32] bcachefs - a new COW filesystem
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (31 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
@ 2023-06-15 20:41 ` Pavel Machek
  2023-06-15 21:26   ` Kent Overstreet
  32 siblings, 1 reply; 186+ messages in thread
From: Pavel Machek @ 2023-06-15 20:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-block, linux-mm,
	linux-bcachefs, viro, akpm, boqun.feng, brauner, hch, colyli,
	djwong, mingo, jack, axboe, willy, ojeda, ming.lei, ndesaulniers,
	peterz, phillip, urezki, longman, will

Hi!

> I'm submitting the bcachefs filesystem for review and inclusion.
> 
> Included in this patch series are all the non fs/bcachefs/ patches. The
> entire tree, based on v6.3, may be found at:
> 
>   http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream
> 
> ----------------------------------------------------------------
> 
> bcachefs overview, status:
> 
> Features:
>  - too many to list
> 
> Known bugs:
>  - too many to list


Documentation: missing.

Dunno. I guess it would help review if feature and known bugs lists were included.

BR,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 00/32] bcachefs - a new COW filesystem
  2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
@ 2023-06-15 21:26   ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-06-15 21:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, linux-fsdevel, linux-block, linux-mm,
	linux-bcachefs, viro, akpm, boqun.feng, brauner, hch, colyli,
	djwong, mingo, jack, axboe, willy, ojeda, ming.lei, ndesaulniers,
	peterz, phillip, urezki, longman, will

On Thu, Jun 15, 2023 at 10:41:56PM +0200, Pavel Machek wrote:
> Hi!
> 
> > I'm submitting the bcachefs filesystem for review and inclusion.
> > 
> > Included in this patch series are all the non fs/bcachefs/ patches. The
> > entire tree, based on v6.3, may be found at:
> > 
> >   http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream
> > 
> > ----------------------------------------------------------------
> > 
> > bcachefs overview, status:
> > 
> > Features:
> >  - too many to list
> > 
> > Known bugs:
> >  - too many to list
> 
> 
> Documentation: missing.

https://bcachefs.org/bcachefs-principles-of-operation.pdf

> Dunno. I guess it would help review if feature and known bugs lists were included.

https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

https://github.com/koverstreet/bcachefs/issues/

Hope that helps...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:20           ` Kent Overstreet
  2023-05-16 21:47             ` Matthew Wilcox
@ 2023-06-17  4:13             ` Andy Lutomirski
  2023-06-17 15:34               ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-17  4:13 UTC (permalink / raw)
  To: Kent Overstreet, Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On 5/16/23 14:20, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
>> For something that small, why not use the text_poke API?
> 
> This looks like it's meant for patching existing kernel text, which
> isn't what I want - I'm generating new functions on the fly, one per
> btree node.

Dynamically generating code is a giant can of worms.

Kees touched on a basic security thing: a linear address mapped W+X is a big
no-no.  And that's just scratching the surface -- ideally we would have a
strong protocol for generating code: the code is generated in some
extra-secure context, then it's made immutable and double-checked, then
it becomes live.  (And we would offer this to userspace, some day.)
Just having a different address for the W and X aliases is pretty weak.

(When x86 modifies itself at boot or for static keys, it changes out the
page tables temporarily.)

And even beyond security, we have correctness.  x86 is a fairly 
forgiving architecture.  If you go back in time about 20 years, modify
some code *at the same linear address at which you intend to execute 
it*, and jump to it, it works.  It may even work if you do it through
an alias (the manual is vague).  But it's not 20 years ago, and you have
multiple cores.  This does *not* work with multiple CPUs -- you need to 
serialize on the CPU executing the modified code.  On all the but the 
very newest CPUs, you need to kludge up the serialization, and that's
sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
is merely sloooooow.

(The manual is terrible.  It's clear that a way to do this without 
serializing must exist, because that's what happens when code is paged 
in from a user program.)

And remember that x86 is the forgiving architecture.  Other 
architectures have their own rules that may involve all kinds of 
terrifying cache management.  IIRC ARM (32-bit) is really quite nasty in 
this regard.  I've seen some references suggesting that RISC-V has a 
broken design of its cache management and this is a real mess.

x86 low level stuff on Linux gets away with it because the 
implementation is conservative and very slow, but it's very rarely invoked.

eBPF gets away with it in ways that probably no one really likes, but 
also no one expects eBPF to load programs particularly quickly.

You are proposing doing this when a btree node is loaded.  You could 
spend 20 *thousand* cycles, on *each CPU*, the first time you access 
that node, not to mention the extra branch to decide whether you need to 
spend those 20k cycles.  Or you could use IPIs.

Or you could just not do this.  I think you should just remove all this 
dynamic codegen stuff, at least for now.

> 
> I'm working up a new allocator - a (very simple) slab allocator where
> you pass a buffer, and it gives you a copy of that buffer mapped
> executable, but not writeable.
> 
> It looks like we'll be able to convert bpf, kprobes, and ftrace
> trampolines to it; it'll consolidate a fair amount of code (particularly
> in bpf), and they won't have to burn a full page per allocation anymore.
> 
> bpf has a neat trick where it maps the same page in two different
> locations, one is the executable location and the other is the writeable
> location - I'm stealing that.
> 
> external api will be:
> 
> void *jit_alloc(void *buf, size_t len, gfp_t gfp);
> void jit_free(void *buf);
> void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */

Based on the above, I regret to inform you that jit_update() will either 
need to sync all cores via IPI or all cores will need to check whether a 
sync is needed and do it themselves.

That IPI could be, I dunno, 500k cycles?  1M cycles?  Depends on what 
cores are asleep at the time.  (I have some old Sandy Bridge machines 
where, if you tick all the boxes wrong, you might spend tens of 
milliseconds doing this due to power savings gone wrong.)  Or are you 
planning to implement a fancy mostly-lockless thing to track which cores 
actually need the IPI so you can avoid waking up sleeping cores?

Sorry to be a party pooper.

--Andy

P.S. I have given some thought to how to make a JIT API that was 
actually (somewhat) performant.  It's nontrivial, and it would involve 
having at least phone calls and possibly actual meetings with people who 
understand the microarchitecture of various CPUs to get all the details 
hammered out and documented properly.

I don't think it would be efficient for teeny little functions like 
bcachefs wants, but maybe?  That would be even more complex and messy.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17  4:13             ` Andy Lutomirski
@ 2023-06-17 15:34               ` Kent Overstreet
  2023-06-17 19:19                 ` Andy Lutomirski
  2023-06-19 19:45                 ` Kees Cook
  0 siblings, 2 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-06-17 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> On 5/16/23 14:20, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> 
> Dynamically generating code is a giant can of worms.
> 
> Kees touched on a basic security thing: a linear address mapped W+X is a big
> no-no.  And that's just scratching the surface -- ideally we would have a
> strong protocol for generating code: the code is generated in some
> extra-secure context, then it's made immutable and double-checked, then
> it becomes live.

"Double checking" arbitrary code is is fantasy. You can't "prove the
security" of arbitrary code post compilation.

Rice's theorem states that any nontrivial property of a program is
either a direct consequence of the syntax, or is undecidable. It's why
programs in statically typed languages are easier to reason about, and
it's also why the borrow checker in Rust is a syntactic construct.

You just have to be able to trust the code that generates the code. Just
like you have to be able to trust any other code that lives in kernel
space.

This is far safer and easier to reason about than what BPF is doing
because we're not compiling arbitrary code, the actual codegen part is
200 loc and the input is just a single table.

> 
> (When x86 modifies itself at boot or for static keys, it changes out the
> page tables temporarily.)
> 
> And even beyond security, we have correctness.  x86 is a fairly forgiving
> architecture.  If you go back in time about 20 years, modify
> some code *at the same linear address at which you intend to execute it*,
> and jump to it, it works.  It may even work if you do it through
> an alias (the manual is vague).  But it's not 20 years ago, and you have
> multiple cores.  This does *not* work with multiple CPUs -- you need to
> serialize on the CPU executing the modified code.  On all the but the very
> newest CPUs, you need to kludge up the serialization, and that's
> sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
> is merely sloooooow.

If what you were saying was true, it would be an issue any time we
mapped in new executable code for userspace - minor page faults would be
stupidly slow.

This code has been running on thousands of machines for years, and the
only issues that have come up have been due to the recent introduction
of indirect branch tracking. x86 doesn't have such broken caches, and
architectures that do have utterly broken caches (because that's what
you're describing: you're describing caches that _are not coherent
across cores_) are not high on my list of things I care about.

Also, SERIALIZE is a spectre thing. Not relevant here.

> Based on the above, I regret to inform you that jit_update() will either
> need to sync all cores via IPI or all cores will need to check whether a
> sync is needed and do it themselves.

text_poke() doesn't even send IPIs.

I think you've been misled about some things :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 15:34               ` Kent Overstreet
@ 2023-06-17 19:19                 ` Andy Lutomirski
  2023-06-17 20:08                   ` Kent Overstreet
  2023-06-19 19:45                 ` Kees Cook
  1 sibling, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-17 19:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
>> On 5/16/23 14:20, Kent Overstreet wrote:
>> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
>> > > For something that small, why not use the text_poke API?
>> > 
>> > This looks like it's meant for patching existing kernel text, which
>> > isn't what I want - I'm generating new functions on the fly, one per
>> > btree node.
>> 
>> Dynamically generating code is a giant can of worms.
>> 
>> Kees touched on a basic security thing: a linear address mapped W+X is a big
>> no-no.  And that's just scratching the surface -- ideally we would have a
>> strong protocol for generating code: the code is generated in some
>> extra-secure context, then it's made immutable and double-checked, then
>> it becomes live.
>
> "Double checking" arbitrary code is is fantasy. You can't "prove the
> security" of arbitrary code post compilation.
>
> Rice's theorem states that any nontrivial property of a program is
> either a direct consequence of the syntax, or is undecidable. It's why
> programs in statically typed languages are easier to reason about, and
> it's also why the borrow checker in Rust is a syntactic construct.

If you want security in some theoretical sense, sure, you're probably right.  But that doesn't stop people from double-checking executable code to quite good effect.  For example:

https://www.bitdefender.com/blog/businessinsights/bitdefender-releases-landmark-open-source-software-project-hypervisor-based-memory-introspection/

(I have no personal experience with this, but I know people who do.  It's obviously not perfect, but I think it provides meaningful benefits.)

I'm not saying Linux should do this internally, but it might not be a terrible idea some day.

>
> You just have to be able to trust the code that generates the code. Just
> like you have to be able to trust any other code that lives in kernel
> space.
>
> This is far safer and easier to reason about than what BPF is doing
> because we're not compiling arbitrary code, the actual codegen part is
> 200 loc and the input is just a single table.

Great, then propose a model where the codegen operates in an extra-safe protected context.  Or pre-generate the most common variants, have them pull their constants from memory instead of immediates, and use that.

>
>> 
>> (When x86 modifies itself at boot or for static keys, it changes out the
>> page tables temporarily.)
>> 
>> And even beyond security, we have correctness.  x86 is a fairly forgiving
>> architecture.  If you go back in time about 20 years, modify
>> some code *at the same linear address at which you intend to execute it*,
>> and jump to it, it works.  It may even work if you do it through
>> an alias (the manual is vague).  But it's not 20 years ago, and you have
>> multiple cores.  This does *not* work with multiple CPUs -- you need to
>> serialize on the CPU executing the modified code.  On all the but the very
>> newest CPUs, you need to kludge up the serialization, and that's
>> sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
>> is merely sloooooow.
>
> If what you were saying was true, it would be an issue any time we
> mapped in new executable code for userspace - minor page faults would be
> stupidly slow.

I literally mentioned this in the email.

I don't know _precisely_ what's going on, but I assume it's that it's impossible (assuming the kernel gets TLB invalidation right) for a CPU to have anything buffered for a linear address that is unmapped, so when it gets mapped, the CPU can't have anything stale in its buffers.  (By buffers, I mean any sort of instruction or decoded instruction cache.)

Having *this* conversation is what I was talking about in regard to possible fancy future optimization.

>
> This code has been running on thousands of machines for years, and the
> only issues that have come up have been due to the recent introduction
> of indirect branch tracking. x86 doesn't have such broken caches, and
> architectures that do have utterly broken caches (because that's what
> you're describing: you're describing caches that _are not coherent
> across cores_) are not high on my list of things I care about.

I care.  And a bunch of people who haven't gotten their filesystem corrupted because of a missed serialization.

>
> Also, SERIALIZE is a spectre thing. Not relevant here.

Nope, try again.  SERIALIZE "serializes" in the rather vague sense in the Intel SDM.  I don't think it's terribly useful for Spectre.

(Yes, I know what I'm talking about.)

>
>> Based on the above, I regret to inform you that jit_update() will either
>> need to sync all cores via IPI or all cores will need to check whether a
>> sync is needed and do it themselves.
>
> text_poke() doesn't even send IPIs.

text_poke() and the associated machinery is unbelievably complicated.  

Also, arch/x86/kernel/alternative.c contains:

void text_poke_sync(void)
{
	on_each_cpu(do_sync_core, NULL, 1);
}

The magic in text_poke() was developed over the course of years, and Intel architects were involved.

(And I think some text_poke() stuff uses RCU, which is another way to sync without IPI.  I doubt the performance characteristics are appropriate for bcachefs, but I could be wrong.)

>
> I think you've been misled about some things :)

I wish.


I like bcachefs.  I really don't want to have to put on my maintainer hat here, and I do indeed generally stay in the background.  (And I haven't had nearly as much time for this kind of the work in the last couple years as I'd like, sigh.) But I personally have a fairly strict opinion that, if someone (including myself!) wants to merge something that plays clever games that may cause x86 architecture code (especially mm code) to do things it shouldn't in corner cases, even if no one has directly observed that corner case or even knows how to get it to misbehave, then they had better have a very convincing argument that it's safe.  No one likes debugging bugs when something that should be coherent becomes incoherent.

So, if you really really want self-modifying code in bcachefs, its correctness needs to be very, very well argued, and it needs to be maintainable.  Otherwise I will NAK it.  Sorry.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 19:19                 ` Andy Lutomirski
@ 2023-06-17 20:08                   ` Kent Overstreet
  2023-06-17 20:35                     ` Andy Lutomirski
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-06-17 20:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023 at 12:19:41PM -0700, Andy Lutomirski wrote:
> On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:
> > On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> >> On 5/16/23 14:20, Kent Overstreet wrote:
> >> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> >> > > For something that small, why not use the text_poke API?
> >> > 
> >> > This looks like it's meant for patching existing kernel text, which
> >> > isn't what I want - I'm generating new functions on the fly, one per
> >> > btree node.
> >> 
> >> Dynamically generating code is a giant can of worms.
> >> 
> >> Kees touched on a basic security thing: a linear address mapped W+X is a big
> >> no-no.  And that's just scratching the surface -- ideally we would have a
> >> strong protocol for generating code: the code is generated in some
> >> extra-secure context, then it's made immutable and double-checked, then
> >> it becomes live.
> >
> > "Double checking" arbitrary code is is fantasy. You can't "prove the
> > security" of arbitrary code post compilation.
> >
> > Rice's theorem states that any nontrivial property of a program is
> > either a direct consequence of the syntax, or is undecidable. It's why
> > programs in statically typed languages are easier to reason about, and
> > it's also why the borrow checker in Rust is a syntactic construct.
> 
> If you want security in some theoretical sense, sure, you're probably right.  But that doesn't stop people from double-checking executable code to quite good effect.  For example:
> 
> https://www.bitdefender.com/blog/businessinsights/bitdefender-releases-landmark-open-source-software-project-hypervisor-based-memory-introspection/
> 
> (I have no personal experience with this, but I know people who do.  It's obviously not perfect, but I think it provides meaningful benefits.)
> 
> I'm not saying Linux should do this internally, but it might not be a terrible idea some day.

So you want to pull a virus scanner into the kernel.

> > You just have to be able to trust the code that generates the code. Just
> > like you have to be able to trust any other code that lives in kernel
> > space.
> >
> > This is far safer and easier to reason about than what BPF is doing
> > because we're not compiling arbitrary code, the actual codegen part is
> > 200 loc and the input is just a single table.
> 
> Great, then propose a model where the codegen operates in an
> extra-safe protected context.  Or pre-generate the most common
> variants, have them pull their constants from memory instead of
> immediates, and use that.

I'll do no such nonsense.

> > If what you were saying was true, it would be an issue any time we
> > mapped in new executable code for userspace - minor page faults would be
> > stupidly slow.
> 
> I literally mentioned this in the email.

No, you didn't. Feel free to link or cite if you think otherwise.

> 
> I don't know _precisely_ what's going on, but I assume it's that it's impossible (assuming the kernel gets TLB invalidation right) for a CPU to have anything buffered for a linear address that is unmapped, so when it gets mapped, the CPU can't have anything stale in its buffers.  (By buffers, I mean any sort of instruction or decoded instruction cache.)
> 
> Having *this* conversation is what I was talking about in regard to possible fancy future optimization.
> 
> >
> > This code has been running on thousands of machines for years, and the
> > only issues that have come up have been due to the recent introduction
> > of indirect branch tracking. x86 doesn't have such broken caches, and
> > architectures that do have utterly broken caches (because that's what
> > you're describing: you're describing caches that _are not coherent
> > across cores_) are not high on my list of things I care about.
> 
> I care.  And a bunch of people who haven't gotten their filesystem corrupted because of a missed serialization.
> 
> >
> > Also, SERIALIZE is a spectre thing. Not relevant here.
> 
> Nope, try again.  SERIALIZE "serializes" in the rather vague sense in the Intel SDM.  I don't think it's terribly useful for Spectre.
> 
> (Yes, I know what I'm talking about.)
> 
> >
> >> Based on the above, I regret to inform you that jit_update() will either
> >> need to sync all cores via IPI or all cores will need to check whether a
> >> sync is needed and do it themselves.
> >
> > text_poke() doesn't even send IPIs.
> 
> text_poke() and the associated machinery is unbelievably complicated.  

It's not that bad.

The only reference to IPIs in text_poke() is the comment that indicates
that flush_tlb_mm_range() may sometimes do IPIs, but explicitly
indicates that it does _not_ do IPIs the way text_poke() is using it.

> Also, arch/x86/kernel/alternative.c contains:
> 
> void text_poke_sync(void)
> {
> 	on_each_cpu(do_sync_core, NULL, 1);
> }

...which is for modifying code that is currently being executed, not the
text_poke() or text_poke_copy() paths.

> 
> The magic in text_poke() was developed over the course of years, and
> Intel architects were involved.
> 
> (And I think some text_poke() stuff uses RCU, which is another way to
> sync without IPI.  I doubt the performance characteristics are
> appropriate for bcachefs, but I could be wrong.)

No, it doesn't use RCU.

> > I think you've been misled about some things :)
> 
> I wish.

Given your comments on text_poke(), I think you are. You're confusing
synchronization requirements for _self modifying_ code with the
synchronization requirements for writing new code to memory, and then
executing it.

And given that bcachefs is not doing anything new here - we're doing a
more limited form of what BPF is already doing - I don't think this is
even the appropriate place for this discussion. There is a new
executable memory allocator being developed and posted, which is
expected to wrap text_poke() in an arch-independent way so that
allocations can share pages, and so that we can remove the need to have
pages mapped both writeable and executable.

If you've got knowledge you wish to share on how to get cache coherency
right, I think that might be a more appropriate thread.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 20:08                   ` Kent Overstreet
@ 2023-06-17 20:35                     ` Andy Lutomirski
  0 siblings, 0 replies; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-17 20:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening



On Sat, Jun 17, 2023, at 1:08 PM, Kent Overstreet wrote:
> On Sat, Jun 17, 2023 at 12:19:41PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:

>> Great, then propose a model where the codegen operates in an
>> extra-safe protected context.  Or pre-generate the most common
>> variants, have them pull their constants from memory instead of
>> immediates, and use that.
>
> I'll do no such nonsense.

You can avoid generating code beyond what gcc generates at all, or you can pre-generate code but not on an ongoing basis at runtime, or you can generate code at runtime correctly.  I don't think there are many other options.

>
>> > If what you were saying was true, it would be an issue any time we
>> > mapped in new executable code for userspace - minor page faults would be
>> > stupidly slow.
>> 
>> I literally mentioned this in the email.
>
> No, you didn't. Feel free to link or cite if you think otherwise.

"It's clear that a way to do this without 
serializing must exist, because that's what happens when code is paged 
in from a user program."

>> > text_poke() doesn't even send IPIs.
>> 
>> text_poke() and the associated machinery is unbelievably complicated.  
>
> It's not that bad.

This is a useless discussion.

>
> The only reference to IPIs in text_poke() is the comment that indicates
> that flush_tlb_mm_range() may sometimes do IPIs, but explicitly
> indicates that it does _not_ do IPIs the way text_poke() is using it.
>
>> Also, arch/x86/kernel/alternative.c contains:
>> 
>> void text_poke_sync(void)
>> {
>> 	on_each_cpu(do_sync_core, NULL, 1);
>> }
>
> ...which is for modifying code that is currently being executed, not the
> text_poke() or text_poke_copy() paths.
>
>> 
>> The magic in text_poke() was developed over the course of years, and
>> Intel architects were involved.
>> 
>> (And I think some text_poke() stuff uses RCU, which is another way to
>> sync without IPI.  I doubt the performance characteristics are
>> appropriate for bcachefs, but I could be wrong.)
>
> No, it doesn't use RCU.

It literally says in alternative.c:

 * Not safe against concurrent execution; useful for JITs to dump
 * new code blocks into unused regions of RX memory. Can be used in
 * conjunction with synchronize_rcu_tasks() to wait for existing
 * execution to quiesce after having made sure no existing functions
 * pointers are live.

I don't know whether any callers actually do this.  I didn't look.

>
>> > I think you've been misled about some things :)
>> 
>> I wish.
>
> Given your comments on text_poke(), I think you are. You're confusing
> synchronization requirements for _self modifying_ code with the
> synchronization requirements for writing new code to memory, and then
> executing it.

No, you are misunderstanding the difference.

Version A:

User mmap()s an executable file (DSO, whatever).  At first, there is either no PTE or a not-present PTE.  At some point, in response to a page fault or just the kernel prefetching, the kernel fills in the backing page and then creates the PTE.  From the CPU's perspective, the virtual address atomically transitions from having nothing there to having the final code there.  It works (despite the manual having nothing to say about this case).  It's also completely unavoidable.

Version B:

Kernel vmallocs some space *and populates the pagetables*.  There is backing storage, that is executable (or it's a non-NX system, although those are quite rare these days).

Because the CPU hates you, it speculatively executes that code.  (Maybe you're under attack.  Maybe you're just unlucky.  Doesn't matter.)  It populates the instruction cache, remembers the decoded instructions, etc.  It does all the things that make the manual say scary things about serialization.  It notices that the speculative execution was wrong and backs it out, but nothing is invalidated.

Now you write code into there.  Either you do this from a different CPU or you do it at a different linear address, so the magic hardware that invalidates for you does not trigger.

Now you jump into that code, and you tell yourself that it was new code because it was all zeros before and you never intentionally executed it.  But the CPU could not care less what you think, and you lose.

>
> And given that bcachefs is not doing anything new here - we're doing a
> more limited form of what BPF is already doing - I don't think this is
> even the appropriate place for this discussion. There is a new
> executable memory allocator being developed and posted, which is
> expected to wrap text_poke() in an arch-independent way so that
> allocations can share pages, and so that we can remove the need to have
> pages mapped both writeable and executable.

I don't really care what BPF is doing, and BPF may well have the same problem.

But if I understood what bcachefs is doing, it's creating code vastly more frequently than BPF, in response to entirely unprivileged operations from usermode.  It's a whole different amount of exposure.

>
> If you've got knowledge you wish to share on how to get cache coherency
> right, I think that might be a more appropriate thread.

I'll look.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/32] stacktrace: Export stack_trace_save_tsk
  2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
@ 2023-06-19  9:10   ` Mark Rutland
  2023-06-19 11:16     ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Mark Rutland @ 2023-06-19  9:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs,
	Christopher James Halse Rogers

On Tue, May 09, 2023 at 12:56:53PM -0400, Kent Overstreet wrote:
> From: Christopher James Halse Rogers <raof@ubuntu.com>
> 
> The bcachefs module wants it, and there doesn't seem to be any
> reason it shouldn't be exported like the other functions.

What is the bcachefs module using this for?

Is that just for debug purposes? Assuming so, mentioning that in the commit
message would be helpful.

Thanks,
Mark.

> Signed-off-by: Christopher James Halse Rogers <raof@ubuntu.com>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  kernel/stacktrace.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 9ed5ce9894..4f65824879 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
>  	put_task_stack(tsk);
>  	return c.len;
>  }
> +EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
>  
>  /**
>   * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
> @@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
>  	save_stack_trace_tsk(task, &trace);
>  	return trace.nr_entries;
>  }
> +EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
>  
>  /**
>   * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
> -- 
> 2.40.1
> 
> 

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
                     ` (3 preceding siblings ...)
  2023-05-10 15:05   ` Johannes Thumshirn
@ 2023-06-19  9:19   ` Mark Rutland
  2023-06-19 10:47     ` Kent Overstreet
  4 siblings, 1 reply; 186+ messages in thread
From: Mark Rutland @ 2023-06-19  9:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kees Cook, Andy Lutomirski

On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.

Much like Kees and Andy, I have concerns with adding new code generators to the
kernel. Even ignoring the actual code generation, there are a bunch of subtle
ordering/maintenance/synchronization concerns across architectures, and we
already have a fair amount of pain with the existing cases.

Can you share more detail on how you want to use this?

From a quick scan of your gitweb for the bcachefs-for-upstream branch I
couldn't spot the relevant patches.

Thanks,
Mark.

> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: linux-mm@kvack.org
> ---
>  include/linux/vmalloc.h |  1 +
>  kernel/module/main.c    |  4 +---
>  mm/nommu.c              | 18 ++++++++++++++++++
>  mm/vmalloc.c            | 21 +++++++++++++++++++++
>  4 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 69250efa03..ff147fe115 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
> +extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
>  extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de70..9eaa89e84c 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
>  
>  void * __weak module_alloc(unsigned long size)
>  {
> -	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	return vmalloc_exec(size, GFP_KERNEL);
>  }
>  
>  bool __weak module_init_section(const char *name)
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a..8d9ab19e39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>  
> +/**
> + *	vmalloc_exec  -  allocate virtually contiguous, executable memory
> + *	@size:		allocation size
> + *
> + *	Kernel-internal function to allocate enough pages to cover @size
> + *	the page level allocator and map them into contiguous and
> + *	executable kernel virtual space.
> + *
> + *	For tight control over page level allocator and protection flags
> + *	use __vmalloc() instead.
> + */
> +
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc(size, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  /**
>   * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
>   *	@size:		allocation size
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d36..2ebb9ea7f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>  
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> -- 
> 2.40.1
> 
> 

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19  9:19   ` Mark Rutland
@ 2023-06-19 10:47     ` Kent Overstreet
  2023-06-19 12:47       ` Mark Rutland
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-06-19 10:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kees Cook, Andy Lutomirski

On Mon, Jun 19, 2023 at 10:19:00AM +0100, Mark Rutland wrote:
> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > This is needed for bcachefs, which dynamically generates per-btree node
> > unpack functions.
> 
> Much like Kees and Andy, I have concerns with adding new code generators to the
> kernel. Even ignoring the actual code generation, there are a bunch of subtle
> ordering/maintenance/synchronization concerns across architectures, and we
> already have a fair amount of pain with the existing cases.

Look, jits are just not that unusual. I'm not going to be responding to
vague concerns that don't have any actual engineering rational.

> Can you share more detail on how you want to use this?
> 
> From a quick scan of your gitweb for the bcachefs-for-upstream branch I
> couldn't spot the relevant patches.

I've already written extensively in this thread.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/32] stacktrace: Export stack_trace_save_tsk
  2023-06-19  9:10   ` Mark Rutland
@ 2023-06-19 11:16     ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-06-19 11:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs,
	Christopher James Halse Rogers

On Mon, Jun 19, 2023 at 10:10:56AM +0100, Mark Rutland wrote:
> On Tue, May 09, 2023 at 12:56:53PM -0400, Kent Overstreet wrote:
> > From: Christopher James Halse Rogers <raof@ubuntu.com>
> > 
> > The bcachefs module wants it, and there doesn't seem to be any
> > reason it shouldn't be exported like the other functions.
> 
> What is the bcachefs module using this for?
> 
> Is that just for debug purposes? Assuming so, mentioning that in the commit
> message would be helpful.

Yes, the output ends up in debugfs.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19 10:47     ` Kent Overstreet
@ 2023-06-19 12:47       ` Mark Rutland
  2023-06-19 19:17         ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Mark Rutland @ 2023-06-19 12:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kees Cook, Andy Lutomirski

On Mon, Jun 19, 2023 at 06:47:17AM -0400, Kent Overstreet wrote:
> On Mon, Jun 19, 2023 at 10:19:00AM +0100, Mark Rutland wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > 
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> > 
> > Much like Kees and Andy, I have concerns with adding new code generators to the
> > kernel. Even ignoring the actual code generation, there are a bunch of subtle
> > ordering/maintenance/synchronization concerns across architectures, and we
> > already have a fair amount of pain with the existing cases.
> 
> Look, jits are just not that unusual. I'm not going to be responding to
> vague concerns that don't have any actual engineering rational.

Sorry, but I do have an engineering rationale here: I want to make sure that
this actually works, on architectures that I care about, and will be
maintanable long-term.

We've had a bunch of problems with other JITs ranging from JIT-local "we got
the encoding wrong" to major kernel infrastructure changes like tasks RCU rude
synchronization. I'm trying to figure out whether any of those are likely to
apply and/or whether we should be refactoring other infrastructure for use here
(e.g. the factoring the acutal instruction generation from arch code, or
perhaps reusing eBPF so this can be arch-neutral).

I appreciate that's not clear from my initial mail, but please don't jump
straight to assuming I'm adversarial here.

> > Can you share more detail on how you want to use this?
> > 
> > From a quick scan of your gitweb for the bcachefs-for-upstream branch I
> > couldn't spot the relevant patches.
> 
> I've already written extensively in this thread.

Sorry, I hadn't seen that.

For the benefit of others, the codegen is at:

  https://lore.kernel.org/lkml/ZFq7JhrhyrMTNfd%2F@moria.home.lan/
  https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

... and the rationale is at:

  https://lore.kernel.org/lkml/ZF6HHRDeUWLNtuL7@moria.home.lan/

One thing I note mmediately is that HAVE_BCACHEFS_COMPILED_UNPACK seems to be
x86-only. If this is important, that'll need some rework to either be
arch-neutral or allow for arch-specific implementations.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19 12:47       ` Mark Rutland
@ 2023-06-19 19:17         ` Kent Overstreet
  2023-06-20 17:42           ` Andy Lutomirski
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-06-19 19:17 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kees Cook, Andy Lutomirski

On Mon, Jun 19, 2023 at 01:47:18PM +0100, Mark Rutland wrote:
> Sorry, but I do have an engineering rationale here: I want to make sure that
> this actually works, on architectures that I care about, and will be
> maintanable long-term.
> 
> We've had a bunch of problems with other JITs ranging from JIT-local "we got
> the encoding wrong" to major kernel infrastructure changes like tasks RCU rude
> synchronization. I'm trying to figure out whether any of those are likely to
> apply and/or whether we should be refactoring other infrastructure for use here
> (e.g. the factoring the acutal instruction generation from arch code, or
> perhaps reusing eBPF so this can be arch-neutral).
> 
> I appreciate that's not clear from my initial mail, but please don't jump
> straight to assuming I'm adversarial here.

I know you're not trying to be adversarial, but vague negative feedback
_is_ hostile, because productive technical discussions can't happen
without specifics and you're putting all the onus on the other person to
make that happen.

When you're raising an issue, try be specific - don't make people dig.
If you're unable to be specific, perhaps you're not the right person to
be raising the issue.

I'm of course happy to answer questions that haven't already been asked.

This code is pretty simple as JITs go. With the existing, vmalloc_exec()
based code, there aren't any fancy secondary mappings going on, so no
crazy cache coherency games, and no crazy syncronization issues to worry
about: the jit functions are protected by the per-btree-node locks.

vmalloc_exec() isn't being upstreamed however, since people don't want
WX mappings.

The infrastructure changes we need (and not just for bcachefs) are
 - better executable memory allocation API, with support for sub-page
   allocations: this is already being worked on, the prototype slab
   allocator I posted is probably going to be the basis for part of this

 - an arch indepenendent version of text_poke(): we don't want user code
   to be flipping page permissions to update text, text_poke() is the
   proper API but it's x86 only. No one has volunteered for this yet.

Re-using eBPF for bcachefs's unpack functions is not outside the realm
of possibility, but BPF is a heavy, complex dependency - it's not
something I'll be looking at unless the BPF people are volunteering to
refactor their stuff to provide a suitable API.

> One thing I note mmediately is that HAVE_BCACHEFS_COMPILED_UNPACK seems to be
> x86-only. If this is important, that'll need some rework to either be
> arch-neutral or allow for arch-specific implementations.

Correct. Arm will happen at some point, but it's not an immediate
priority.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 15:34               ` Kent Overstreet
  2023-06-17 19:19                 ` Andy Lutomirski
@ 2023-06-19 19:45                 ` Kees Cook
  2023-06-20  0:39                   ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Kees Cook @ 2023-06-19 19:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andy Lutomirski, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023 at 11:34:31AM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> > On 5/16/23 14:20, Kent Overstreet wrote:
> > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > For something that small, why not use the text_poke API?
> > > 
> > > This looks like it's meant for patching existing kernel text, which
> > > isn't what I want - I'm generating new functions on the fly, one per
> > > btree node.
> > 
> > Dynamically generating code is a giant can of worms.
> > 
> > Kees touched on a basic security thing: a linear address mapped W+X is a big
> > no-no.  And that's just scratching the surface -- ideally we would have a
> > strong protocol for generating code: the code is generated in some
> > extra-secure context, then it's made immutable and double-checked, then
> > it becomes live.
> 
> "Double checking" arbitrary code is is fantasy. You can't "prove the
> security" of arbitrary code post compilation.

I think there's a misunderstanding here about the threat model I'm
interested in protecting against for JITs. While making sure the VM of a
JIT is safe in itself, that's separate from what I'm concerned about.

The threat model is about flaws _elsewhere_ in the kernel that can
leverage the JIT machinery to convert a "write anything anywhere anytime"
exploit primitive into an "execute anything" primitive. Arguments can
be made to say "a write anything flaw means the total collapse of the
security model so there's no point defending against it", but both that
type of flaw and the slippery slope argument don't stand up well to
real-world situations.

The kinds of flaws we've seen are frequently limited in scope (write
1 byte, write only NULs, write only in a specific range, etc), but
when chained together, the weakest link is what ultimately compromises
the kernel. As such, "W^X" is a basic building block of the kernel's
self-defense methods, because it is such a potent target for a
write->execute attack upgrades.

Since a JIT constructs something that will become executable, it needs
to defend itself against stray writes from other threads. Since Linux
doesn't (really) use per-CPU page tables, the workspace for a JIT can be
targeted by something that isn't the JIT. To deal with this, JITs need
to use 3 phases: a writing pass (into W memory), then switch it to RO
and perform a verification pass (construct it again, but compare results
to the RO version), and finally switch it executable. Or, it can use
writes to memory that only the local CPU can perform (i.e. text_poke(),
which uses a different set of page tables with different permissions).

Without basic W^X, it becomes extremely difficult to build further
defenses (e.g. protecting page tables themselves, etc) since WX will
remain the easiest target.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19 19:45                 ` Kees Cook
@ 2023-06-20  0:39                   ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-06-20  0:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Mon, Jun 19, 2023 at 12:45:43PM -0700, Kees Cook wrote:
> I think there's a misunderstanding here about the threat model I'm
> interested in protecting against for JITs. While making sure the VM of a
> JIT is safe in itself, that's separate from what I'm concerned about.
> 
> The threat model is about flaws _elsewhere_ in the kernel that can
> leverage the JIT machinery to convert a "write anything anywhere anytime"
> exploit primitive into an "execute anything" primitive. Arguments can
> be made to say "a write anything flaw means the total collapse of the
> security model so there's no point defending against it", but both that
> type of flaw and the slippery slope argument don't stand up well to
> real-world situations.

Hey Kees, thanks for the explanation - I don't think this is a concern
for what bcachefs is doing, since we're not doing a full jit. The unpack
functions we generate only write to the 40 bytes pointed to by rsi; not
terribly useful as an execute anything primitive :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19 19:17         ` Kent Overstreet
@ 2023-06-20 17:42           ` Andy Lutomirski
  2023-06-20 18:08             ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-20 17:42 UTC (permalink / raw)
  To: Kent Overstreet, Mark Rutland
  Cc: Linux Kernel Mailing List, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	Kees Cook, the arch/x86 maintainers

On Mon, Jun 19, 2023, at 12:17 PM, Kent Overstreet wrote:
> On Mon, Jun 19, 2023 at 01:47:18PM +0100, Mark Rutland wrote:
>> Sorry, but I do have an engineering rationale here: I want to make sure that
>> this actually works, on architectures that I care about, and will be
>> maintanable long-term.
>> 
>> We've had a bunch of problems with other JITs ranging from JIT-local "we got
>> the encoding wrong" to major kernel infrastructure changes like tasks RCU rude
>> synchronization. I'm trying to figure out whether any of those are likely to
>> apply and/or whether we should be refactoring other infrastructure for use here
>> (e.g. the factoring the acutal instruction generation from arch code, or
>> perhaps reusing eBPF so this can be arch-neutral).
>> 
>> I appreciate that's not clear from my initial mail, but please don't jump
>> straight to assuming I'm adversarial here.
>
> I know you're not trying to be adversarial, but vague negative feedback
> _is_ hostile, because productive technical discussions can't happen
> without specifics and you're putting all the onus on the other person to
> make that happen.

I'm sorry, but this isn't how correct code gets written, and this isn't how at least x86 maintenance operates.

Code is either correct, and comes with an explanation as to how it is correct, or it doesn't go in.  Saying that something is like BPF is not an explanation as to how it's correct.  Saying that someone has not come up with the chain of events that causes a mere violation of architecture rules to actual incorrect execution is not an explanation as to how something is correct.

So, without intending any particular hostility:

<puts on maintainer hat>

bcachefs's x86 JIT is:
Nacked-by: Andy Lutomirski <luto@kernel.org> # for x86

<takes off maintainer hat>

This makes me sad, because I like bcachefs.  But you can get it merged without worrying about my NAK by removing the x86 part.

>
> When you're raising an issue, try be specific - don't make people dig.
> If you're unable to be specific, perhaps you're not the right person to
> be raising the issue.
>
> I'm of course happy to answer questions that haven't already been asked.
>
> This code is pretty simple as JITs go. With the existing, vmalloc_exec()
> based code, there aren't any fancy secondary mappings going on, so no
> crazy cache coherency games, and no crazy syncronization issues to worry
> about: the jit functions are protected by the per-btree-node locks.
>
> vmalloc_exec() isn't being upstreamed however, since people don't want
> WX mappings.
>
> The infrastructure changes we need (and not just for bcachefs) are
>  - better executable memory allocation API, with support for sub-page
>    allocations: this is already being worked on, the prototype slab
>    allocator I posted is probably going to be the basis for part of this
>
>  - an arch indepenendent version of text_poke(): we don't want user code
>    to be flipping page permissions to update text, text_poke() is the
>    proper API but it's x86 only. No one has volunteered for this yet.
>

text_poke() by itself is *not* the proper API, as discussed.  It doesn't serialize adequately, even on x86.  We have text_poke_sync() for that.

--Andy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 17:42           ` Andy Lutomirski
@ 2023-06-20 18:08             ` Kent Overstreet
  2023-06-20 18:15               ` Andy Lutomirski
  0 siblings, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-06-20 18:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Linux Kernel Mailing List, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, Kees Cook, the arch/x86 maintainers

On Tue, Jun 20, 2023 at 10:42:02AM -0700, Andy Lutomirski wrote:
> Code is either correct, and comes with an explanation as to how it is
> correct, or it doesn't go in.  Saying that something is like BPF is
> not an explanation as to how it's correct.  Saying that someone has
> not come up with the chain of events that causes a mere violation of
> architecture rules to actual incorrect execution is not an explanation
> as to how something is correct.

No, I'm saying your concerns are baseless and too vague to address.

> text_poke() by itself is *not* the proper API, as discussed.  It
> doesn't serialize adequately, even on x86.  We have text_poke_sync()
> for that.

Andy, I replied explaining the difference between text_poke() and
text_poke_sync(). It's clear you have no idea what you're talking about,
so I'm not going to be wasting my time on further communications with
you.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 18:08             ` Kent Overstreet
@ 2023-06-20 18:15               ` Andy Lutomirski
  2023-06-20 18:48                 ` Dave Hansen
  0 siblings, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-20 18:15 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mark Rutland, Linux Kernel Mailing List, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, Kees Cook, the arch/x86 maintainers

On Tue, Jun 20, 2023, at 11:08 AM, Kent Overstreet wrote:
> On Tue, Jun 20, 2023 at 10:42:02AM -0700, Andy Lutomirski wrote:
>> Code is either correct, and comes with an explanation as to how it is
>> correct, or it doesn't go in.  Saying that something is like BPF is
>> not an explanation as to how it's correct.  Saying that someone has
>> not come up with the chain of events that causes a mere violation of
>> architecture rules to actual incorrect execution is not an explanation
>> as to how something is correct.
>
> No, I'm saying your concerns are baseless and too vague to address.

If you don't address them, the NAK will stand forever, or at least until a different group of people take over x86 maintainership.  That's fine with me.

I'm generally pretty happy about working with people to get their Linux code right.  But no one is obligated to listen to me.

>
>> text_poke() by itself is *not* the proper API, as discussed.  It
>> doesn't serialize adequately, even on x86.  We have text_poke_sync()
>> for that.
>
> Andy, I replied explaining the difference between text_poke() and
> text_poke_sync(). It's clear you have no idea what you're talking about,
> so I'm not going to be wasting my time on further communications with
> you.

No problem.  Then your x86 code will not be merged upstream.

Best of luck with the actual filesystem parts!

--Andy

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 18:15               ` Andy Lutomirski
@ 2023-06-20 18:48                 ` Dave Hansen
  2023-06-20 20:18                   ` Kent Overstreet
  2023-06-20 20:42                   ` Andy Lutomirski
  0 siblings, 2 replies; 186+ messages in thread
From: Dave Hansen @ 2023-06-20 18:48 UTC (permalink / raw)
  To: Andy Lutomirski, Kent Overstreet
  Cc: Mark Rutland, Linux Kernel Mailing List, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, Kees Cook, the arch/x86 maintainers

>> No, I'm saying your concerns are baseless and too vague to
>> address.
> If you don't address them, the NAK will stand forever, or at least
> until a different group of people take over x86 maintainership.
> That's fine with me.

I've got a specific concern: I don't see vmalloc_exec() used in this
series anywhere.  I also don't see any of the actual assembly that's
being generated, or the glue code that's calling into the generated
assembly.

I grepped around a bit in your git trees, but I also couldn't find it in
there.  Any chance you could help a guy out and point us to some of the
specifics of this new, tiny JIT?

>> Andy, I replied explaining the difference between text_poke() and
>> text_poke_sync(). It's clear you have no idea what you're talking about,
>> so I'm not going to be wasting my time on further communications with
>> you.

One more specific concern: This comment made me very uncomfortable and
it read to me very much like a personal attack, something which is
contrary to our code of conduct.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 18:48                 ` Dave Hansen
@ 2023-06-20 20:18                   ` Kent Overstreet
  2023-06-20 20:42                   ` Andy Lutomirski
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-06-20 20:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Mark Rutland, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, Kees Cook,
	the arch/x86 maintainers

On Tue, Jun 20, 2023 at 11:48:26AM -0700, Dave Hansen wrote:
> >> No, I'm saying your concerns are baseless and too vague to
> >> address.
> > If you don't address them, the NAK will stand forever, or at least
> > until a different group of people take over x86 maintainership.
> > That's fine with me.
> 
> I've got a specific concern: I don't see vmalloc_exec() used in this
> series anywhere.  I also don't see any of the actual assembly that's
> being generated, or the glue code that's calling into the generated
> assembly.
>
> I grepped around a bit in your git trees, but I also couldn't find it in
> there.  Any chance you could help a guy out and point us to some of the
> specifics of this new, tiny JIT?

vmalloc_exec() has already been dropped from the patchset - I'll switch
to the new jit allocator when that's available and doing sub-page
allocations.

I can however point you at the code that generates the unpack functions:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

> >> Andy, I replied explaining the difference between text_poke() and
> >> text_poke_sync(). It's clear you have no idea what you're talking about,
> >> so I'm not going to be wasting my time on further communications with
> >> you.
> 
> One more specific concern: This comment made me very uncomfortable and
> it read to me very much like a personal attack, something which is
> contrary to our code of conduct.

It's not; I prefer to be direct than passive aggressive, and if I have
to bow out of a discussion that isn't going anywhere I feel I owe an
explanation of _why_. Too much conflict avoidance means things don't get
resolved.

And Andy and I are talking on IRC now, so things are proceeding in a
better direction.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 18:48                 ` Dave Hansen
  2023-06-20 20:18                   ` Kent Overstreet
@ 2023-06-20 20:42                   ` Andy Lutomirski
  2023-06-20 22:32                     ` Andy Lutomirski
  1 sibling, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-20 20:42 UTC (permalink / raw)
  To: Dave Hansen, Kent Overstreet
  Cc: Mark Rutland, Linux Kernel Mailing List, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, Kees Cook, the arch/x86 maintainers

Hi all-

On Tue, Jun 20, 2023, at 11:48 AM, Dave Hansen wrote:
>>> No, I'm saying your concerns are baseless and too vague to
>>> address.
>> If you don't address them, the NAK will stand forever, or at least
>> until a different group of people take over x86 maintainership.
>> That's fine with me.
>
> I've got a specific concern: I don't see vmalloc_exec() used in this
> series anywhere.  I also don't see any of the actual assembly that's
> being generated, or the glue code that's calling into the generated
> assembly.
>
> I grepped around a bit in your git trees, but I also couldn't find it in
> there.  Any chance you could help a guy out and point us to some of the
> specifics of this new, tiny JIT?
>

So I had a nice discussion with Kent on IRC, and, for the benefit of everyone else reading along, I *think* the JITted code can be replaced by a table-driven approach like this:

typedef unsigned int u32;
typedef unsigned long u64;

struct uncompressed
{
    u32 a;
    u32 b;
    u64 c;
    u64 d;
    u64 e;
    u64 f;
};

struct bitblock
{
    u64 source;
    u64 target;
    u64 mask;
    int shift;
};

// out needs to be zeroed first
void unpack(struct uncompressed *out, const u64 *in, const struct bitblock *blocks, int nblocks)
{
    u64 *out_as_words = (u64*)out;
    for (int i = 0; i < nblocks; i++) {
        const struct bitblock *b;
        out_as_words[b->target] |= (in[b->source] & b->mask) << b->shift;
    }
}

void apply_offsets(struct uncompressed *out, const struct uncompressed *offsets)
{
    out->a += offsets->a;
    out->b += offsets->b;
    out->c += offsets->c;
    out->d += offsets->d;
    out->e += offsets->e;
    out->f += offsets->f;
}

Which generates nice code: https://godbolt.org/z/3fEq37hf5

It would need spectre protection in two places, I think, because it's almost most certainly a great gadget if the attacker can speculatively control the 'blocks' table.  This could be mitigated (I think) by hardcoding nblocks as 12 and by masking b->target.

In contrast, the JIT approach needs a retpoline on each call, which could be more expensive than my entire function :)  I haven't benchmarked them lately.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 20:42                   ` Andy Lutomirski
@ 2023-06-20 22:32                     ` Andy Lutomirski
  2023-06-20 22:43                       ` Nadav Amit
  0 siblings, 1 reply; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-20 22:32 UTC (permalink / raw)
  To: Dave Hansen, Kent Overstreet
  Cc: Mark Rutland, Linux Kernel Mailing List, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, Kees Cook, the arch/x86 maintainers



On Tue, Jun 20, 2023, at 1:42 PM, Andy Lutomirski wrote:
> Hi all-
>
> On Tue, Jun 20, 2023, at 11:48 AM, Dave Hansen wrote:
>>>> No, I'm saying your concerns are baseless and too vague to
>>>> address.
>>> If you don't address them, the NAK will stand forever, or at least
>>> until a different group of people take over x86 maintainership.
>>> That's fine with me.
>>
>> I've got a specific concern: I don't see vmalloc_exec() used in this
>> series anywhere.  I also don't see any of the actual assembly that's
>> being generated, or the glue code that's calling into the generated
>> assembly.
>>
>> I grepped around a bit in your git trees, but I also couldn't find it in
>> there.  Any chance you could help a guy out and point us to some of the
>> specifics of this new, tiny JIT?
>>
>
> So I had a nice discussion with Kent on IRC, and, for the benefit of 
> everyone else reading along, I *think* the JITted code can be replaced 
> by a table-driven approach like this:
>
> typedef unsigned int u32;
> typedef unsigned long u64;
>
> struct uncompressed
> {
>     u32 a;
>     u32 b;
>     u64 c;
>     u64 d;
>     u64 e;
>     u64 f;
> };
>
> struct bitblock
> {
>     u64 source;
>     u64 target;
>     u64 mask;
>     int shift;
> };
>
> // out needs to be zeroed first
> void unpack(struct uncompressed *out, const u64 *in, const struct 
> bitblock *blocks, int nblocks)
> {
>     u64 *out_as_words = (u64*)out;
>     for (int i = 0; i < nblocks; i++) {
>         const struct bitblock *b;
>         out_as_words[b->target] |= (in[b->source] & b->mask) << 
> b->shift;
>     }
> }
>
> void apply_offsets(struct uncompressed *out, const struct uncompressed *offsets)
> {
>     out->a += offsets->a;
>     out->b += offsets->b;
>     out->c += offsets->c;
>     out->d += offsets->d;
>     out->e += offsets->e;
>     out->f += offsets->f;
> }
>
> Which generates nice code: https://godbolt.org/z/3fEq37hf5

Thinking about this a bit more, I think the only real performance issue with my code is that it does 12 read-xor-write operations in memory, which all depend on each other in horrible ways.

If it's reversed so the stores are all in order, then this issue would go away.

typedef unsigned int u32;
typedef unsigned long u64;

struct uncompressed
{
    u32 a;
    u32 b;
    u64 c;
    u64 d;
    u64 e;
    u64 f;
};

struct field_piece {
    int source;
    int shift;
    u64 mask;
};

struct field_pieces {
    struct field_piece pieces[2];
    u64 offset;
};

u64 unpack_one(const u64 *in, const struct field_pieces *pieces)
{
    const struct field_piece *p = pieces->pieces;
    return (((in[p[0].source] & p[0].mask) << p[0].shift) |
        ((in[p[1].source] & p[1].mask) << p[1].shift)) +
        pieces->offset;
}

struct encoding {
    struct field_pieces a, b, c, d, e, f;
};

void unpack(struct uncompressed *out, const u64 *in, const struct encoding *encoding)
{
    out->a = unpack_one(in, &encoding->a);
    out->b = unpack_one(in, &encoding->b);
    out->c = unpack_one(in, &encoding->c);
    out->d = unpack_one(in, &encoding->d);
    out->e = unpack_one(in, &encoding->e);
    out->f = unpack_one(in, &encoding->f);
}

https://godbolt.org/z/srsfcGK4j

Could be faster.  Probably worth testing.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 22:32                     ` Andy Lutomirski
@ 2023-06-20 22:43                       ` Nadav Amit
  2023-06-21  1:27                         ` Andy Lutomirski
  0 siblings, 1 reply; 186+ messages in thread
From: Nadav Amit @ 2023-06-20 22:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kent Overstreet, Mark Rutland,
	Linux Kernel Mailing List, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	Kees Cook, the arch/x86 maintainers


> On Jun 20, 2023, at 3:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
>> // out needs to be zeroed first
>> void unpack(struct uncompressed *out, const u64 *in, const struct 
>> bitblock *blocks, int nblocks)
>> {
>>    u64 *out_as_words = (u64*)out;
>>    for (int i = 0; i < nblocks; i++) {
>>        const struct bitblock *b;
>>        out_as_words[b->target] |= (in[b->source] & b->mask) << 
>> b->shift;
>>    }
>> }
>> 
>> void apply_offsets(struct uncompressed *out, const struct uncompressed *offsets)
>> {
>>    out->a += offsets->a;
>>    out->b += offsets->b;
>>    out->c += offsets->c;
>>    out->d += offsets->d;
>>    out->e += offsets->e;
>>    out->f += offsets->f;
>> }
>> 
>> Which generates nice code: https://godbolt.org/z/3fEq37hf5
> 
> Thinking about this a bit more, I think the only real performance issue with my code is that it does 12 read-xor-write operations in memory, which all depend on each other in horrible ways.

If you compare the generated code, just notice that you forgot to initialize b in unpack() in this version.

I presume you wanted it to say "b = &blocks[i]”.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-20 22:43                       ` Nadav Amit
@ 2023-06-21  1:27                         ` Andy Lutomirski
  0 siblings, 0 replies; 186+ messages in thread
From: Andy Lutomirski @ 2023-06-21  1:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, Kent Overstreet, Mark Rutland,
	Linux Kernel Mailing List, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	Kees Cook, the arch/x86 maintainers



On Tue, Jun 20, 2023, at 3:43 PM, Nadav Amit wrote:
>> On Jun 20, 2023, at 3:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>>> // out needs to be zeroed first
>>> void unpack(struct uncompressed *out, const u64 *in, const struct 
>>> bitblock *blocks, int nblocks)
>>> {
>>>    u64 *out_as_words = (u64*)out;
>>>    for (int i = 0; i < nblocks; i++) {
>>>        const struct bitblock *b;
>>>        out_as_words[b->target] |= (in[b->source] & b->mask) << 
>>> b->shift;
>>>    }
>>> }
>>> 
>>> void apply_offsets(struct uncompressed *out, const struct uncompressed *offsets)
>>> {
>>>    out->a += offsets->a;
>>>    out->b += offsets->b;
>>>    out->c += offsets->c;
>>>    out->d += offsets->d;
>>>    out->e += offsets->e;
>>>    out->f += offsets->f;
>>> }
>>> 
>>> Which generates nice code: https://godbolt.org/z/3fEq37hf5
>> 
>> Thinking about this a bit more, I think the only real performance issue with my code is that it does 12 read-xor-write operations in memory, which all depend on each other in horrible ways.
>
> If you compare the generated code, just notice that you forgot to 
> initialize b in unpack() in this version.
>
> I presume you wanted it to say "b = &blocks[i]”.

Indeed.  I also didn't notice that -Wall wasn't set.  Oops.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
@ 2023-07-12 19:58   ` Kees Cook
  2023-07-12 20:19     ` Kent Overstreet
  2023-07-12 20:23     ` Kent Overstreet
  0 siblings, 2 replies; 186+ messages in thread
From: Kees Cook @ 2023-07-12 19:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> printbuf now needs to know the number of characters that would have been
> written if the buffer was too small, like snprintf(); this changes
> string_get_size() to return the the return value of snprintf().

Unfortunately, snprintf doesn't return characters written, it return
what it TRIED to write, and can cause a lot of problems[1]. This patch
would be fine with me if the snprintf was also replaced by scnprintf,
which will return the actual string length copied (or 0) *not* including
the trailing %NUL.

> [...]
> @@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
>  	else
>  		unit = units_str[units][i];
>  
> -	snprintf(buf, len, "%u%s %s", (u32)size,
> -		 tmp, unit);
> +	return snprintf(buf, len, "%u%s %s", (u32)size,
> +			tmp, unit);

-Kees

[1] https://github.com/KSPP/linux/issues/105

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 19:58   ` Kees Cook
@ 2023-07-12 20:19     ` Kent Overstreet
  2023-07-12 22:38       ` Kees Cook
  2023-07-12 20:23     ` Kent Overstreet
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-07-12 20:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
> 
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

...All of which would be solved if we were converting code away from raw
char * buffers to a proper string building type.

Which I tried to address when I tried to push printbufs upstream, but
that turned into a giant exercise in frustration in dealing with
maintainers.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 19:58   ` Kees Cook
  2023-07-12 20:19     ` Kent Overstreet
@ 2023-07-12 20:23     ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-07-12 20:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
> 
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

Anyways, I can't use scnprintf here, printbufs/seq_buf both need the
number of characters that would have been written, but I'll update the
comment.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 20:19     ` Kent Overstreet
@ 2023-07-12 22:38       ` Kees Cook
  2023-07-12 23:53         ` Kent Overstreet
  0 siblings, 1 reply; 186+ messages in thread
From: Kees Cook @ 2023-07-12 22:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 04:19:31PM -0400, Kent Overstreet wrote:
> On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> > On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > 
> > > printbuf now needs to know the number of characters that would have been
> > > written if the buffer was too small, like snprintf(); this changes
> > > string_get_size() to return the the return value of snprintf().
> > 
> > Unfortunately, snprintf doesn't return characters written, it return
> > what it TRIED to write, and can cause a lot of problems[1]. This patch
> > would be fine with me if the snprintf was also replaced by scnprintf,
> > which will return the actual string length copied (or 0) *not* including
> > the trailing %NUL.
> 
> ...All of which would be solved if we were converting code away from raw
> char * buffers to a proper string building type.
> 
> Which I tried to address when I tried to push printbufs upstream, but
> that turned into a giant exercise in frustration in dealing with
> maintainers.

Heh, yeah, I've been trying to aim people at using seq_buf instead of
a long series of snprintf/strlcat/etc calls. Where can I look at how
you wired this up to seq_buf/printbuf? I had trouble finding it when I
looked before. I'd really like to find a way to do it without leaving
around foot-guns for future callers of string_get_size(). :)

I found the printbuf series:
https://lore.kernel.org/lkml/20220808024128.3219082-1-willy@infradead.org/
It seems there are some nice improvements in there. It'd be really nice
if seq_buf could just grow those changes. Adding a static version of
seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
(or even a statically sized initializer). And much of the conversions
is just changing types and functions. If we can leave all that alone,
things become MUCH easier to review, etc, etc. I'd *love* to see an
incremental improvement for seq_buf, especially the heap-allocation
part.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 22:38       ` Kees Cook
@ 2023-07-12 23:53         ` Kent Overstreet
  0 siblings, 0 replies; 186+ messages in thread
From: Kent Overstreet @ 2023-07-12 23:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 03:38:44PM -0700, Kees Cook wrote:
> Heh, yeah, I've been trying to aim people at using seq_buf instead of
> a long series of snprintf/strlcat/etc calls. Where can I look at how
> you wired this up to seq_buf/printbuf? I had trouble finding it when I
> looked before. I'd really like to find a way to do it without leaving
> around foot-guns for future callers of string_get_size(). :)
> 
> I found the printbuf series:
> https://lore.kernel.org/lkml/20220808024128.3219082-1-willy@infradead.org/
> It seems there are some nice improvements in there. It'd be really nice
> if seq_buf could just grow those changes. Adding a static version of
> seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
> (or even a statically sized initializer). And much of the conversions
> is just changing types and functions. If we can leave all that alone,
> things become MUCH easier to review, etc, etc. I'd *love* to see an
> incremental improvement for seq_buf, especially the heap-allocation
> part.

Well, I raised that with Steve way back when I was starting on the
conversions of existing code, and I couldn't get any communication out
him regarding making those changes to seq_buf.

So, I'd _love_ to resurrect that patch series and get it in after the
bcachefs merger, but don't expect me to go back and redo everything :)
the amount of code in existing seq_buf users is fairly small compared to
bcachef's printbuf usage, and what that patch series does in the rest of
the kernel anyways.

I'd rather save that energy for ditching the seq_file interface and
making that just use a printbuf - clean up that bit of API
fragmentation.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-23  9:28   ` (subset) " Christian Brauner
@ 2023-10-19 15:30     ` Mateusz Guzik
  2023-10-19 15:59       ` Mateusz Guzik
  0 siblings, 1 reply; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-19 15:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Alexander Viro

On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > Because scalability of the global inode_hash_lock really, really
> > sucks.
> > 
> > 32-way concurrent create on a couple of different filesystems
> > before:
> > 
> > -   52.13%     0.04%  [kernel]            [k] ext4_create
> >    - 52.09% ext4_create
> >       - 41.03% __ext4_new_inode
> >          - 29.92% insert_inode_locked
> >             - 25.35% _raw_spin_lock
> >                - do_raw_spin_lock
> >                   - 24.97% __pv_queued_spin_lock_slowpath
> > 
> > [...]
> 
> This is interesting completely independent of bcachefs so we should give
> it some testing.
> 
> I updated a few places that had outdated comments.
> 
> ---
> 
> Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> 
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
> 
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.unstable.inode-hash
> 
> [22/32] vfs: inode cache conversion to hash-bl
>         https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1

What, if anything, is blocking this? It is over 5 months now, I don't
see it in master nor -next.

To be clear there is no urgency as far as I'm concerned, but I did run
into something which is primarily bottlenecked by inode hash lock and
looks like the above should sort it out.

Looks like the patch was simply forgotten.

tl;dr can this land in -next please

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-19 15:30     ` Mateusz Guzik
@ 2023-10-19 15:59       ` Mateusz Guzik
  2023-10-20 11:38         ` Dave Chinner
  0 siblings, 1 reply; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-19 15:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Alexander Viro

On Thu, Oct 19, 2023 at 05:30:40PM +0200, Mateusz Guzik wrote:
> On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> > On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > > Because scalability of the global inode_hash_lock really, really
> > > sucks.
> > > 
> > > 32-way concurrent create on a couple of different filesystems
> > > before:
> > > 
> > > -   52.13%     0.04%  [kernel]            [k] ext4_create
> > >    - 52.09% ext4_create
> > >       - 41.03% __ext4_new_inode
> > >          - 29.92% insert_inode_locked
> > >             - 25.35% _raw_spin_lock
> > >                - do_raw_spin_lock
> > >                   - 24.97% __pv_queued_spin_lock_slowpath
> > > 
> > > [...]
> > 
> > This is interesting completely independent of bcachefs so we should give
> > it some testing.
> > 
> > I updated a few places that had outdated comments.
> > 
> > ---
> > 
> > Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> > Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> > 
> > Please report any outstanding bugs that were missed during review in a
> > new review to the original patch series allowing us to drop it.
> > 
> > It's encouraged to provide Acked-bys and Reviewed-bys even though the
> > patch has now been applied. If possible patch trailers will be updated.
> > 
> > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> > branch: vfs.unstable.inode-hash
> > 
> > [22/32] vfs: inode cache conversion to hash-bl
> >         https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1
> 
> What, if anything, is blocking this? It is over 5 months now, I don't
> see it in master nor -next.
> 
> To be clear there is no urgency as far as I'm concerned, but I did run
> into something which is primarily bottlenecked by inode hash lock and
> looks like the above should sort it out.
> 
> Looks like the patch was simply forgotten.
> 
> tl;dr can this land in -next please

In case you can't be arsed, here is something funny which may convince
you to expedite. ;)

I did some benching by running 20 processes in parallel, each doing stat
on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
so 20 mln inodes in total).  Box had 24 cores and 24G RAM.

Best times:
Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total

NetBSD failed to complete the run, OOM-killing workers:
http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
time.

So what I find funny is that Linux needed more time than OmniosCE (an
Illumos variant, fork of Solaris).

It also needed more time than FreeBSD, which is not necessarily funny
but not that great either.

All systems were mostly busy contending on locks and in particular Linux
was almost exclusively busy waiting on inode hash lock.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-19 15:59       ` Mateusz Guzik
@ 2023-10-20 11:38         ` Dave Chinner
  2023-10-20 17:49           ` Mateusz Guzik
  0 siblings, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-10-20 11:38 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
> On Thu, Oct 19, 2023 at 05:30:40PM +0200, Mateusz Guzik wrote:
> > On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> > > On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > > > Because scalability of the global inode_hash_lock really, really
> > > > sucks.
> > > > 
> > > > 32-way concurrent create on a couple of different filesystems
> > > > before:
> > > > 
> > > > -   52.13%     0.04%  [kernel]            [k] ext4_create
> > > >    - 52.09% ext4_create
> > > >       - 41.03% __ext4_new_inode
> > > >          - 29.92% insert_inode_locked
> > > >             - 25.35% _raw_spin_lock
> > > >                - do_raw_spin_lock
> > > >                   - 24.97% __pv_queued_spin_lock_slowpath
> > > > 
> > > > [...]
> > > 
> > > This is interesting completely independent of bcachefs so we should give
> > > it some testing.
> > > 
> > > I updated a few places that had outdated comments.
> > > 
> > > ---
> > > 
> > > Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> > > Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> > > 
> > > Please report any outstanding bugs that were missed during review in a
> > > new review to the original patch series allowing us to drop it.
> > > 
> > > It's encouraged to provide Acked-bys and Reviewed-bys even though the
> > > patch has now been applied. If possible patch trailers will be updated.
> > > 
> > > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> > > branch: vfs.unstable.inode-hash
> > > 
> > > [22/32] vfs: inode cache conversion to hash-bl
> > >         https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1
> > 
> > What, if anything, is blocking this? It is over 5 months now, I don't
> > see it in master nor -next.

Not having a test machine that can validate my current vfs-scale
patchset for 4 of the 5 months makes it hard to measure and
demonstrate the efficacy of the changes on a current kernel....

> > To be clear there is no urgency as far as I'm concerned, but I did run
> > into something which is primarily bottlenecked by inode hash lock and
> > looks like the above should sort it out.
> > 
> > Looks like the patch was simply forgotten.
> > 
> > tl;dr can this land in -next please
> 
> In case you can't be arsed, here is something funny which may convince
> you to expedite. ;)
> 
> I did some benching by running 20 processes in parallel, each doing stat
> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
> 
> Best times:
> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total
> 
> NetBSD failed to complete the run, OOM-killing workers:
> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
> OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
> time.
> 
> So what I find funny is that Linux needed more time than OmniosCE (an
> Illumos variant, fork of Solaris).
> 
> It also needed more time than FreeBSD, which is not necessarily funny
> but not that great either.
> 
> All systems were mostly busy contending on locks and in particular Linux
> was almost exclusively busy waiting on inode hash lock.

Did you bother to test the patch, or are you just complaining
that nobody has already done the work for you?

Because if you tested the patch, you'd have realised that by itself
it does nothing to improve performance of the concurrent find+stat
workload. The lock contention simply moves to the sb_inode_list_lock
instead.

Patches to get rid of the  sb_inode_list_lock contention were
written by smarter people than me long before I wrote the hash-bl
patches. However, the problem the dlist stuff was written to address
(sync iterating all inodes causing sb inode list lock contention
when we have a hundred million cached inodes in memory) was better
fixed by changing the sync algorithm not to iterate all cached
inodes just to find the handful of dirty ones it needed to sync.

IOWs, those sb_inode_list_lock changes haven't been included for the
same reason as the hash-bl patches: outside micro-benchmarks, these
locks just don't show up in profiles on production machines.
Hence there's no urgency to "fix" these lock contention
problems despite the ease with which micro-benchmarks can reproduce
it...

I've kept the patches current for years, even though there hasn't
been a pressing need for them. The last "vfs-scale" version I did
some validation on is here:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=vfs-scale

5.17 was the last kernel I did any serious validation and
measurement against, and that all needs to be repeated before
proposing it for inclusion because lots of stuff has changed since I
last did some serious multi-filesystem a/b testing of this code....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-20 11:38         ` Dave Chinner
@ 2023-10-20 17:49           ` Mateusz Guzik
  2023-10-21 12:13             ` Mateusz Guzik
  2023-10-23  5:10             ` Dave Chinner
  0 siblings, 2 replies; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-20 17:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On 10/20/23, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
>> On Thu, Oct 19, 2023 at 05:30:40PM +0200, Mateusz Guzik wrote:
>> > On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
>> > > On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
>> > > > Because scalability of the global inode_hash_lock really, really
>> > > > sucks.
>> > > >
>> > > > 32-way concurrent create on a couple of different filesystems
>> > > > before:
>> > > >
>> > > > -   52.13%     0.04%  [kernel]            [k] ext4_create
>> > > >    - 52.09% ext4_create
>> > > >       - 41.03% __ext4_new_inode
>> > > >          - 29.92% insert_inode_locked
>> > > >             - 25.35% _raw_spin_lock
>> > > >                - do_raw_spin_lock
>> > > >                   - 24.97% __pv_queued_spin_lock_slowpath
>> > > >
>> > > > [...]
>> > >
>> > > This is interesting completely independent of bcachefs so we should
>> > > give
>> > > it some testing.
>> > >
>> > > I updated a few places that had outdated comments.
>> > >
>> > > ---
>> > >
>> > > Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git
>> > > tree.
>> > > Patches in the vfs.unstable.inode-hash branch should appear in
>> > > linux-next soon.
>> > >
>> > > Please report any outstanding bugs that were missed during review in
>> > > a
>> > > new review to the original patch series allowing us to drop it.
>> > >
>> > > It's encouraged to provide Acked-bys and Reviewed-bys even though the
>> > > patch has now been applied. If possible patch trailers will be
>> > > updated.
>> > >
>> > > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
>> > > branch: vfs.unstable.inode-hash
>> > >
>> > > [22/32] vfs: inode cache conversion to hash-bl
>> > >         https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1
>> >
>> > What, if anything, is blocking this? It is over 5 months now, I don't
>> > see it in master nor -next.
>
> Not having a test machine that can validate my current vfs-scale
> patchset for 4 of the 5 months makes it hard to measure and
> demonstrate the efficacy of the changes on a current kernel....
>

Ok, see below.

>> > To be clear there is no urgency as far as I'm concerned, but I did run
>> > into something which is primarily bottlenecked by inode hash lock and
>> > looks like the above should sort it out.
>> >
>> > Looks like the patch was simply forgotten.
>> >
>> > tl;dr can this land in -next please
>>
>> In case you can't be arsed, here is something funny which may convince
>> you to expedite. ;)
>>
>> I did some benching by running 20 processes in parallel, each doing stat
>> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
>> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
>>
>> Best times:
>> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
>> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
>> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
>> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
>> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total
>>
>> NetBSD failed to complete the run, OOM-killing workers:
>> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
>> OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
>> time.
>>
>> So what I find funny is that Linux needed more time than OmniosCE (an
>> Illumos variant, fork of Solaris).
>>
>> It also needed more time than FreeBSD, which is not necessarily funny
>> but not that great either.
>>
>> All systems were mostly busy contending on locks and in particular Linux
>> was almost exclusively busy waiting on inode hash lock.
>
> Did you bother to test the patch, or are you just complaining
> that nobody has already done the work for you?
>

Why are you giving me attitude?

I ran a test, found the major bottleneck and it turned out there is a
patch which takes care of it, but its inclusion is stalled without
further communication. So I asked about it.

> Because if you tested the patch, you'd have realised that by itself
> it does nothing to improve performance of the concurrent find+stat
> workload. The lock contention simply moves to the sb_inode_list_lock
> instead.
>

Is that something you benched? While it may be there is no change,
going from one bottleneck to another does not automatically mean there
are no gains in performance.

For example, this thing on FreeBSD used to take over one minute (just
like on Linux right now), vast majority of which was spent on
multicore issues. I massaged it down to ~18 seconds, despite it still
being mostly bottlenecked on locks.

So I benched the hashbl change and it provides a marked improvement:
stock:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
patched:  6.34s user 453.87s system 1312% cpu 35.052 total

But indeed as expected it is still bottlenecked on locks.

> IOWs, those sb_inode_list_lock changes haven't been included for the
> same reason as the hash-bl patches: outside micro-benchmarks, these
> locks just don't show up in profiles on production machines.
> Hence there's no urgency to "fix" these lock contention
> problems despite the ease with which micro-benchmarks can reproduce
> it...
>

The above is not a made-up microbenchmark though.

I got someone running FreeBSD whose workload mostly consists of
stating tens of millions of files in parallel and which was suffering
a lot from perf standpoint -- flamegraphs show that contending on
locks due to memory reclamation induced by stat calls is almost
everything that was going on at the time. Said workload probably
should not do that to begin with (instead have a db with everything it
normally stats for?), but here we are.

That is to say, while I would not be in position to test Linux in the
above workload, the problem (high inode turnover in memory) is very
much real.

All that said, if a real deployment which runs into the problem is
needed to justify the change, then I can't help (wrong system).

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-20 17:49           ` Mateusz Guzik
@ 2023-10-21 12:13             ` Mateusz Guzik
  2023-10-23  5:10             ` Dave Chinner
  1 sibling, 0 replies; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-21 12:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

It was bugging me that find_inode_fast is at the top of the profile
(modulo the locking routine).

Internals don't look too bad (it skips collisions without taking
locks), so I started wondering if hashing is any good.

I re-ran the scan of 20 mln and started counting visited inodes for
each call, got this:

[0, 1)             58266 |                                                    |
[1, 2)            385228 |@@@                                                 |
[2, 3)           1252480 |@@@@@@@@@@                                          |
[3, 4)           2710082 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[4, 5)           4385945 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[5, 6)           5662628 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[6, 7)           6074390 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[7, 8)           5575381 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
[8, 9)           4475706 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
[9, 10)          3183676 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[10, 11)         2041743 |@@@@@@@@@@@@@@@@@                                   |
[11, 12)         1189850 |@@@@@@@@@@                                          |
[12, 13)          637683 |@@@@@                                               |
[13, 14)          313830 |@@                                                  |
[14, 15)          143277 |@                                                   |
[15, 16)           61501 |                                                    |
[16, 17)           25116 |                                                    |
[17, 18)            9693 |                                                    |
[18, 19)            3435 |                                                    |
[19, 20)            1120 |                                                    |
[20, 21)             385 |                                                    |
[21, 22)              99 |                                                    |
[22, 23)              45 |                                                    |
[23, 24)              15 |                                                    |
[24, 25)               2 |                                                    |
[25, 26)               2 |                                                    |
[26, 27)               2 |                                                    |

I compared this to literally just taking the ino & i_hash_mask as the
value, got this:
[0, 1)            119800 |                                                    |
[1, 2)            508063 |@@@                                                 |
[2, 3)           1576390 |@@@@@@@@@@@                                         |
[3, 4)           2763163 |@@@@@@@@@@@@@@@@@@@                                 |
[4, 5)           3696348 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[5, 6)           5975274 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          |
[6, 7)           7253615 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[7, 8)           6563736 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
[8, 9)           5012728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
[9, 10)          3495208 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[10, 11)         1606659 |@@@@@@@@@@@                                         |
[11, 12)          459458 |@@@                                                 |
[12, 13)            3940 |                                                    |
[13, 14)              21 |                                                    |
[14, 15)               6 |                                                    |
[15, 16)               2 |                                                    |

That is to say distribution is much better, in particular the longest
streak is 15 (as opposed to 26).

While I'm not saying just taking ino is any good, I am saying there is
room for improvement here as far as one mount point is concerned.

Side note is that sb could have a randomly generated seed instead of
its address, which should help distributing this better.

So tl;dr hash distribution leaves some room for improvement and
*maybe* I'll prod it some time next month. One can also grow the table
of course, but that's for later.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-20 17:49           ` Mateusz Guzik
  2023-10-21 12:13             ` Mateusz Guzik
@ 2023-10-23  5:10             ` Dave Chinner
  2023-10-27 17:13               ` Mateusz Guzik
  1 sibling, 1 reply; 186+ messages in thread
From: Dave Chinner @ 2023-10-23  5:10 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On Fri, Oct 20, 2023 at 07:49:18PM +0200, Mateusz Guzik wrote:
> On 10/20/23, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
> >> > To be clear there is no urgency as far as I'm concerned, but I did run
> >> > into something which is primarily bottlenecked by inode hash lock and
> >> > looks like the above should sort it out.
> >> >
> >> > Looks like the patch was simply forgotten.
> >> >
> >> > tl;dr can this land in -next please
> >>
> >> In case you can't be arsed, here is something funny which may convince
> >> you to expedite. ;)
> >>
> >> I did some benching by running 20 processes in parallel, each doing stat
> >> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
> >> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
> >>
> >> Best times:
> >> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
> >> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
> >> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
> >> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
> >> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total
> >>
> >> NetBSD failed to complete the run, OOM-killing workers:
> >> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
> >> OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
> >> time.
> >>
> >> So what I find funny is that Linux needed more time than OmniosCE (an
> >> Illumos variant, fork of Solaris).
> >>
> >> It also needed more time than FreeBSD, which is not necessarily funny
> >> but not that great either.
> >>
> >> All systems were mostly busy contending on locks and in particular Linux
> >> was almost exclusively busy waiting on inode hash lock.
> >
> > Did you bother to test the patch, or are you just complaining
> > that nobody has already done the work for you?
> 
> Why are you giving me attitude?

Look in the mirror, mate.

Starting off with a derogatory statement like:

"In case you can't be arsed, ..."

is a really good way to start a fight.

I don't think anyone working on this stuff couldn't be bothered to
get their lazy arses off their couches to get it merged. Though you
may not have intended it that way, that's exactly what "can't be
arsed" means. 

I have not asked for this code to be merged because I'm not ready to
ask for it to be merged. I'm trying to be careful and cautious about
changing core kernel code that every linux installation out there
uses because I care about this code being robust and stable. That's
the exact opposite of "can't be arsed"....

Further, you have asked for code that is not ready to be merged to
be merged without reviewing it or even testing it to see if it
solved your reported problem. This is pretty basic stuff - it you
want it merged, then *you also need to put effort into getting it
merged* regardless of who wrote the code. TANSTAAFL.

But you've done neither - you've just made demands and thrown
hypocritical shade implying busy people working on complex code are
lazy arses.

Perhaps you should consider your words more carefully in future?

> > Because if you tested the patch, you'd have realised that by itself
> > it does nothing to improve performance of the concurrent find+stat
> > workload. The lock contention simply moves to the sb_inode_list_lock
> > instead.
> >
> 
> Is that something you benched? While it may be there is no change,
> going from one bottleneck to another does not automatically mean there
> are no gains in performance.

Of course I have. I wouldn't have said anything if this wasn't a
subject I have specific knowledge and expertise in. As I've already
said, I've been running this specific "will it scale" find+stat
micro-benchmark for well over a decade. For example:

https://lore.kernel.org/linux-xfs/20130603074452.GZ29466@dastard/

That's dated June 2013, and the workload is:

"8-way 50 million zero-length file create, 8-way
find+stat of all the files, 8-unlink of all the files:"

Yeah, this workload only scaled to a bit over 4 CPUs a decade ago,
hence I only tested to 8-way....

> For example, this thing on FreeBSD used to take over one minute (just
> like on Linux right now), vast majority of which was spent on
> multicore issues. I massaged it down to ~18 seconds, despite it still
> being mostly bottlenecked on locks.
> 
> So I benched the hashbl change and it provides a marked improvement:
> stock:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
> patched:  6.34s user 453.87s system 1312% cpu 35.052 total
> 
> But indeed as expected it is still bottlenecked on locks.

That's better than I expected, but then again I haven't looked at
this code in detail since around 5.17 and lots has changed since
then.  What filesystem was this? What kernel?  What locks is it
bottlenecked on now?  Did you test the vfs-scale branch I pointed
you at, or just the hash-bl patches?

> > IOWs, those sb_inode_list_lock changes haven't been included for the
> > same reason as the hash-bl patches: outside micro-benchmarks, these
> > locks just don't show up in profiles on production machines.
> > Hence there's no urgency to "fix" these lock contention
> > problems despite the ease with which micro-benchmarks can reproduce
> > it...
> 
> The above is not a made-up microbenchmark though.

I didn't say anything about it being "made up".

There's typically a huge difference in behaviour between the
microbenchmark which immediately discards retrieved data and has no
memory footprint to speak of versus an application that comparing
the retrieved data with an in-memory index of inodes held
in a memory constrained environment to determine if anything has
changed and then doing different work if they have changed.

IOWs, while microbenchmarks can easily produce contention, it's much
less obvious that applications doing substantial userspace work
between similar data retrieval operations will experience similar kernel
level contention problems.

What is lacking here is real world evidence showing this is a
production level problem that needs to be solved immediately....

> I got someone running FreeBSD whose workload mostly consists of
> stating tens of millions of files in parallel and which was suffering
> a lot from perf standpoint -- flamegraphs show that contending on
> locks due to memory reclamation induced by stat calls is almost
> everything that was going on at the time.

.... and "one person's workload on FreeBSD" is not significant
evidence there's a Linux kernel problem that needs to be solved
immediately.

> Said workload probably should not do that to begin with (instead
> have a db with everything it normally stats for?), but here we
> are.

As you state, the right fix for the application is to avoid scanning
tens of millions of inodes repeatedly.  We have functionality in
linux like fanotify to watch and report changes to individual files
in a huge filesystem, so even if this was running on Linux the
push-back would be to use fanotify and avoid repeatedly poll the
entire filesystem to find individual file changes.

> That is to say, while I would not be in position to test Linux in the
> above workload, the problem (high inode turnover in memory) is very
> much real.

Yup, XFS currently bottlenecks at about 800,000 inodes/s being
streamed through memory my old 32p test machine - it's largely the
sb->s_inode_list_lock that is the limitation. The vfs-scale branch
I've pointed to brings that up to about 1.5 million inodes/s before
the next set of limits are hit - the system is CPU bound due to the
aggregate memory demand of ~10GB/s being allocated and freed by the
mm subsystem (not lock contention). Hence further improvements are
all about improving per-lookup operation CPU and memory efficiency..

> All that said, if a real deployment which runs into the problem is
> needed to justify the change, then I can't help (wrong system).

Well, that's kind of the point, though - if users and customers are
not reporting that they have production workloads where 800,000
inodes/s throughput through the inode cache is the performance
limiting factor, then why risk destabilising core code by changing
it?

Yes, we can go faster (as the vfs-scale branch shows), but if
applications aren't limited by the existing code, why risk breaking
every linux installation out there by pushing something that isn't
100% baked?  Nobody wins if the new code is faster for a few but has
bugs that many people hit, so if there's no urgency to change the
code I won't hurry to push the change. I've carried this code for
years, a few months here or there isn't going to change anything
material....

If you think that's wrong or want it faster than I might address it,
then by all means you can take the vfs-scale branch and do the
validation work needed to get it pushed it upstream sooner.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-23  5:10             ` Dave Chinner
@ 2023-10-27 17:13               ` Mateusz Guzik
  2023-10-27 18:36                 ` Darrick J. Wong
  2023-10-31 11:02                 ` Christian Brauner
  0 siblings, 2 replies; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-27 17:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On 10/23/23, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Oct 20, 2023 at 07:49:18PM +0200, Mateusz Guzik wrote:
>> On 10/20/23, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
>> >> > To be clear there is no urgency as far as I'm concerned, but I did
>> >> > run
>> >> > into something which is primarily bottlenecked by inode hash lock
>> >> > and
>> >> > looks like the above should sort it out.
>> >> >
>> >> > Looks like the patch was simply forgotten.
>> >> >
>> >> > tl;dr can this land in -next please
>> >>
>> >> In case you can't be arsed, here is something funny which may convince
>> >> you to expedite. ;)
>> >>
>> >> I did some benching by running 20 processes in parallel, each doing
>> >> stat
>> >> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000
>> >> files,
>> >> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
>> >>
>> >> Best times:
>> >> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
>> >> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
>> >> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
>> >> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
>> >> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total
>> >>
>> >> NetBSD failed to complete the run, OOM-killing workers:
>> >> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
>> >> OpenBSD is shafted by a big kernel lock, so no surprise it takes a
>> >> long
>> >> time.
>> >>
>> >> So what I find funny is that Linux needed more time than OmniosCE (an
>> >> Illumos variant, fork of Solaris).
>> >>
>> >> It also needed more time than FreeBSD, which is not necessarily funny
>> >> but not that great either.
>> >>
>> >> All systems were mostly busy contending on locks and in particular
>> >> Linux
>> >> was almost exclusively busy waiting on inode hash lock.
>> >
>> > Did you bother to test the patch, or are you just complaining
>> > that nobody has already done the work for you?
>>
>> Why are you giving me attitude?
>
> Look in the mirror, mate.
>
> Starting off with a derogatory statement like:
>
> "In case you can't be arsed, ..."
>
> is a really good way to start a fight.
>
> I don't think anyone working on this stuff couldn't be bothered to
> get their lazy arses off their couches to get it merged. Though you
> may not have intended it that way, that's exactly what "can't be
> arsed" means.
>
> I have not asked for this code to be merged because I'm not ready to
> ask for it to be merged. I'm trying to be careful and cautious about
> changing core kernel code that every linux installation out there
> uses because I care about this code being robust and stable. That's
> the exact opposite of "can't be arsed"....
>
> Further, you have asked for code that is not ready to be merged to
> be merged without reviewing it or even testing it to see if it
> solved your reported problem. This is pretty basic stuff - it you
> want it merged, then *you also need to put effort into getting it
> merged* regardless of who wrote the code. TANSTAAFL.
>
> But you've done neither - you've just made demands and thrown
> hypocritical shade implying busy people working on complex code are
> lazy arses.
>

So I took few days to take a look at this with a fresh eye and I see
where the major disconnect is coming from, albeit still don't see how
it came to be nor why it persists.

To my understanding your understanding is that I demand you carry the
hash bl patch over the finish line and I'm rude about it as well.

That is not my position here though.

For starters my opening e-mail was to Christian, not you. You are
CC'ed as the patch author. It is responding to an e-mail which claimed
the patch would land in -next, which to my poking around did not
happen (and I checked it's not in master either). Since there was no
other traffic about it that I could find, I figured it was probably
forgotten. You may also notice the e-mail explicitly states:
1. I have a case which runs into inode hash being a problem
2. *there is no urgency*, I'm just asking what's up with the patch not
getting anywhere.

The follow up including a statement about "being arsed" once more was
to Christian, not you and was rather "tongue in cheek".

If you know about Illumos, it is mostly slow and any serious
performance work stopped there when Oracle closed the codebase over a
decade ago. Or to put it differently, one has to be doing something
really bad to not be faster today. And there was this bad -- the inode
hash. I found it amusing and decided to share in addition to asking
about the patch.

So no Dave, I'm not claiming the patch is not in because anyone is lazy.

Whether the patch is ready for reviews and whatnot is your call to
make as the author.

To repeat from my previous e-mail I note the lock causes real problems
in a real-world setting, it's not just microbenchmarks, but I'm in no
position to test it against the actual workload (only the part I
carved out into a benchmark, where it does help -- gets rid of the
nasty back-to-back lock acquire, first to search for the inode and
then to insert a new one).

If your assessment is that more testing is needed, that makes sense
and is again your call to make. I repeat again I can't help with this
bit though. And if you don't think the effort is justified at the
moment (or there are other things with higher priority), so be it.

It may be I'll stick around in general and if so it may be I'm going
to run into you again.
With this in mind:

> Perhaps you should consider your words more carefully in future?
>

On that front perhaps you could refrain from assuming someone is
trying to call you names or whatnot. But more importantly if you
consider an e-mail to be rude, you can call it out instead of
escalating or responding in what you consider to be the same tone.

All that said I'm bailing from this patchset.

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-27 17:13               ` Mateusz Guzik
@ 2023-10-27 18:36                 ` Darrick J. Wong
  2023-10-31 11:02                 ` Christian Brauner
  1 sibling, 0 replies; 186+ messages in thread
From: Darrick J. Wong @ 2023-10-27 18:36 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Dave Chinner, Christian Brauner, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Alexander Viro

On Fri, Oct 27, 2023 at 07:13:11PM +0200, Mateusz Guzik wrote:
> On 10/23/23, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Oct 20, 2023 at 07:49:18PM +0200, Mateusz Guzik wrote:
> >> On 10/20/23, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
> >> >> > To be clear there is no urgency as far as I'm concerned, but I did
> >> >> > run
> >> >> > into something which is primarily bottlenecked by inode hash lock
> >> >> > and
> >> >> > looks like the above should sort it out.
> >> >> >
> >> >> > Looks like the patch was simply forgotten.
> >> >> >
> >> >> > tl;dr can this land in -next please
> >> >>
> >> >> In case you can't be arsed, here is something funny which may convince
> >> >> you to expedite. ;)
> >> >>
> >> >> I did some benching by running 20 processes in parallel, each doing
> >> >> stat
> >> >> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000
> >> >> files,
> >> >> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
> >> >>
> >> >> Best times:
> >> >> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
> >> >> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
> >> >> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
> >> >> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
> >> >> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total
> >> >>
> >> >> NetBSD failed to complete the run, OOM-killing workers:
> >> >> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
> >> >> OpenBSD is shafted by a big kernel lock, so no surprise it takes a
> >> >> long
> >> >> time.
> >> >>
> >> >> So what I find funny is that Linux needed more time than OmniosCE (an
> >> >> Illumos variant, fork of Solaris).
> >> >>
> >> >> It also needed more time than FreeBSD, which is not necessarily funny
> >> >> but not that great either.
> >> >>
> >> >> All systems were mostly busy contending on locks and in particular
> >> >> Linux
> >> >> was almost exclusively busy waiting on inode hash lock.
> >> >
> >> > Did you bother to test the patch, or are you just complaining
> >> > that nobody has already done the work for you?
> >>
> >> Why are you giving me attitude?
> >
> > Look in the mirror, mate.
> >
> > Starting off with a derogatory statement like:
> >
> > "In case you can't be arsed, ..."
> >
> > is a really good way to start a fight.
> >
> > I don't think anyone working on this stuff couldn't be bothered to
> > get their lazy arses off their couches to get it merged. Though you
> > may not have intended it that way, that's exactly what "can't be
> > arsed" means.
> >
> > I have not asked for this code to be merged because I'm not ready to
> > ask for it to be merged. I'm trying to be careful and cautious about
> > changing core kernel code that every linux installation out there
> > uses because I care about this code being robust and stable. That's
> > the exact opposite of "can't be arsed"....
> >
> > Further, you have asked for code that is not ready to be merged to
> > be merged without reviewing it or even testing it to see if it
> > solved your reported problem. This is pretty basic stuff - it you
> > want it merged, then *you also need to put effort into getting it
> > merged* regardless of who wrote the code. TANSTAAFL.
> >
> > But you've done neither - you've just made demands and thrown
> > hypocritical shade implying busy people working on complex code are
> > lazy arses.
> >
> 
> So I took few days to take a look at this with a fresh eye and I see
> where the major disconnect is coming from, albeit still don't see how
> it came to be nor why it persists.
> 
> To my understanding your understanding is that I demand you carry the
> hash bl patch over the finish line and I'm rude about it as well.
> 
> That is not my position here though.
> 
> For starters my opening e-mail was to Christian, not you. You are
> CC'ed as the patch author. It is responding to an e-mail which claimed
> the patch would land in -next, which to my poking around did not
> happen (and I checked it's not in master either). Since there was no
> other traffic about it that I could find, I figured it was probably
> forgotten. You may also notice the e-mail explicitly states:
> 1. I have a case which runs into inode hash being a problem
> 2. *there is no urgency*, I'm just asking what's up with the patch not
> getting anywhere.
> 
> The follow up including a statement about "being arsed" once more was
> to Christian, not you and was rather "tongue in cheek".

I thought that was a very rude way to address Christian.

Notice how he hasn't even given you a response?

"Hello, this patch improves performance for me on _______ workload.
What needs to be done to get this ready for merging?  I'd like to
take on that work."

--D

> If you know about Illumos, it is mostly slow and any serious
> performance work stopped there when Oracle closed the codebase over a
> decade ago. Or to put it differently, one has to be doing something
> really bad to not be faster today. And there was this bad -- the inode
> hash. I found it amusing and decided to share in addition to asking
> about the patch.
> 
> So no Dave, I'm not claiming the patch is not in because anyone is lazy.
> 
> Whether the patch is ready for reviews and whatnot is your call to
> make as the author.
> 
> To repeat from my previous e-mail I note the lock causes real problems
> in a real-world setting, it's not just microbenchmarks, but I'm in no
> position to test it against the actual workload (only the part I
> carved out into a benchmark, where it does help -- gets rid of the
> nasty back-to-back lock acquire, first to search for the inode and
> then to insert a new one).
> 
> If your assessment is that more testing is needed, that makes sense
> and is again your call to make. I repeat again I can't help with this
> bit though. And if you don't think the effort is justified at the
> moment (or there are other things with higher priority), so be it.
> 
> It may be I'll stick around in general and if so it may be I'm going
> to run into you again.
> With this in mind:
> 
> > Perhaps you should consider your words more carefully in future?
> >
> 
> On that front perhaps you could refrain from assuming someone is
> trying to call you names or whatnot. But more importantly if you
> consider an e-mail to be rude, you can call it out instead of
> escalating or responding in what you consider to be the same tone.
> 
> All that said I'm bailing from this patchset.
> 
> Cheers,
> -- 
> Mateusz Guzik <mjguzik gmail.com>
> 

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-27 17:13               ` Mateusz Guzik
  2023-10-27 18:36                 ` Darrick J. Wong
@ 2023-10-31 11:02                 ` Christian Brauner
  2023-10-31 11:31                   ` Mateusz Guzik
  2023-11-02  2:36                   ` Kent Overstreet
  1 sibling, 2 replies; 186+ messages in thread
From: Christian Brauner @ 2023-10-31 11:02 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Dave Chinner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

> The follow up including a statement about "being arsed" once more was
> to Christian, not you and was rather "tongue in cheek".

Fyi, I can't be arsed to be talked to like that.

> Whether the patch is ready for reviews and whatnot is your call to
> make as the author.

This is basically why that patch never staid in -next. Dave said this
patch is meaningless without his other patchs and I had no reason to
doubt that claim nor currently the cycles to benchmark and disprove it.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-31 11:02                 ` Christian Brauner
@ 2023-10-31 11:31                   ` Mateusz Guzik
  2023-11-02  2:36                   ` Kent Overstreet
  1 sibling, 0 replies; 186+ messages in thread
From: Mateusz Guzik @ 2023-10-31 11:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Alexander Viro

On 10/31/23, Christian Brauner <brauner@kernel.org> wrote:
>> The follow up including a statement about "being arsed" once more was
>> to Christian, not you and was rather "tongue in cheek".
>
> Fyi, I can't be arsed to be talked to like that.
>

Maybe there is a language or cultural barrier at play here and the
above comes off to you as inflammatory.

I assumed the tone here is rather informal. For example here is an
excerpt from your response to me in another thread:
> You're all driving me nuts. ;)

I assumed this was not a serious statement and the "being arsed" thing
was written by me in the same spirit. I find it surprising there is a
strong reaction to it, but I can only explain why.

All that said, I have some patches I intend to submit in the
foreseeable future. I am going to make sure to stick to more
professional tone.

>> Whether the patch is ready for reviews and whatnot is your call to
>> make as the author.
>
> This is basically why that patch never staid in -next. Dave said this
> patch is meaningless without his other patchs and I had no reason to
> doubt that claim nor currently the cycles to benchmark and disprove it.
>

That makes sense, thanks.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-10-31 11:02                 ` Christian Brauner
  2023-10-31 11:31                   ` Mateusz Guzik
@ 2023-11-02  2:36                   ` Kent Overstreet
  2023-11-04 20:51                     ` Dave Chinner
  1 sibling, 1 reply; 186+ messages in thread
From: Kent Overstreet @ 2023-11-02  2:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mateusz Guzik, Dave Chinner, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-bcachefs, Alexander Viro

On Tue, Oct 31, 2023 at 12:02:47PM +0100, Christian Brauner wrote:
> > The follow up including a statement about "being arsed" once more was
> > to Christian, not you and was rather "tongue in cheek".
> 
> Fyi, I can't be arsed to be talked to like that.
> 
> > Whether the patch is ready for reviews and whatnot is your call to
> > make as the author.
> 
> This is basically why that patch never staid in -next. Dave said this
> patch is meaningless without his other patchs and I had no reason to
> doubt that claim nor currently the cycles to benchmark and disprove it.

It was a big benefit to bcachefs performance, and I've had it in my tree
for quite some time. Was there any other holdup?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-11-02  2:36                   ` Kent Overstreet
@ 2023-11-04 20:51                     ` Dave Chinner
  0 siblings, 0 replies; 186+ messages in thread
From: Dave Chinner @ 2023-11-04 20:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, Mateusz Guzik, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-bcachefs, Alexander Viro

On Wed, Nov 01, 2023 at 10:36:15PM -0400, Kent Overstreet wrote:
> On Tue, Oct 31, 2023 at 12:02:47PM +0100, Christian Brauner wrote:
> > > The follow up including a statement about "being arsed" once more was
> > > to Christian, not you and was rather "tongue in cheek".
> > 
> > Fyi, I can't be arsed to be talked to like that.
> > 
> > > Whether the patch is ready for reviews and whatnot is your call to
> > > make as the author.
> > 
> > This is basically why that patch never staid in -next. Dave said this
> > patch is meaningless without his other patchs and I had no reason to
> > doubt that claim nor currently the cycles to benchmark and disprove it.
> 
> It was a big benefit to bcachefs performance, and I've had it in my tree
> for quite some time. Was there any other holdup?

Plenty.

- A lack of recent validation against ext4, btrfs and other
filesystems.
- the loss of lockdep coverage by moving to bit locks
- it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
  inside the inode_hash_lock and we can't do that if we convert the
  inode hash to bit locks because RT makes spinlocks sleeping locks.
- There's been additions for lockless RCU inode hash lookups from
  AFS and ext4 in weird, uncommon corner cases and I have no idea
  how to validate they still work correctly with hash-bl. I suspect
  they should just go away with hash-bl, but....

There's more, but these are the big ones.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 186+ messages in thread

end of thread, other threads:[~2023-11-04 20:51 UTC | newest]

Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
2023-05-09 17:04   ` Miguel Ojeda
2023-05-09 17:24     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
2023-05-09 19:30   ` Peter Zijlstra
2023-05-09 20:11     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
2023-05-09 19:31   ` Peter Zijlstra
2023-05-09 19:57     ` Kent Overstreet
2023-05-09 20:18     ` Kent Overstreet
2023-05-09 20:27       ` Waiman Long
2023-05-09 20:35         ` Kent Overstreet
2023-05-09 21:37           ` Waiman Long
2023-05-10  8:59       ` Peter Zijlstra
2023-05-10 20:38         ` Kent Overstreet
2023-05-11  8:25           ` Peter Zijlstra
2023-05-11  9:32             ` Kent Overstreet
2023-05-12 20:49         ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2023-05-11 12:14   ` Jan Engelhardt
2023-05-12 20:58     ` Kent Overstreet
2023-05-12 22:39       ` Jan Engelhardt
2023-05-12 23:26         ` Kent Overstreet
2023-05-12 23:49           ` Randy Dunlap
2023-05-13  0:17             ` Kent Overstreet
2023-05-13  0:45               ` Eric Biggers
2023-05-13  0:51                 ` Kent Overstreet
2023-05-14 12:15   ` Jeff Layton
2023-05-15  2:39     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
2023-05-10  1:07   ` Jan Kara
2023-05-10  6:18     ` Kent Overstreet
2023-05-23 13:34       ` Jan Kara
2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
2023-05-23 16:35           ` Kent Overstreet
2023-05-24  6:43             ` Christoph Hellwig
2023-05-24  8:09               ` Kent Overstreet
2023-05-25  8:58                 ` Christoph Hellwig
2023-05-25 20:50                   ` Kent Overstreet
2023-05-26  8:06                     ` Christoph Hellwig
2023-05-26  8:34                       ` Kent Overstreet
2023-05-25 21:40                   ` Kent Overstreet
2023-05-25 22:25           ` Andreas Grünbacher
2023-05-25 23:20             ` Kent Overstreet
2023-05-26  0:05               ` Andreas Grünbacher
2023-05-26  0:39                 ` Kent Overstreet
2023-05-26  8:10               ` Christoph Hellwig
2023-05-26  8:38                 ` Kent Overstreet
2023-05-23 16:49         ` Kent Overstreet
2023-05-25  8:47           ` Jan Kara
2023-05-25 21:36             ` Kent Overstreet
2023-05-25 22:45             ` Andreas Grünbacher
2023-05-25 22:04         ` Andreas Grünbacher
2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
2023-05-09 18:19   ` Lorenzo Stoakes
2023-05-09 20:15     ` Kent Overstreet
2023-05-09 20:46   ` Christoph Hellwig
2023-05-09 21:12     ` Lorenzo Stoakes
2023-05-09 21:29       ` Kent Overstreet
2023-05-10  6:48         ` Eric Biggers
2023-05-12 18:36           ` Kent Overstreet
2023-05-13  1:57             ` Eric Biggers
2023-05-13 19:28               ` Kent Overstreet
2023-05-14  5:45               ` Kent Overstreet
2023-05-14 18:43                 ` Eric Biggers
2023-05-15  5:38                   ` Kent Overstreet
2023-05-15  6:13                     ` Eric Biggers
2023-05-15  6:18                       ` Kent Overstreet
2023-05-15  7:13                         ` Eric Biggers
2023-05-15  7:26                           ` Kent Overstreet
2023-05-21 21:33                             ` Eric Biggers
2023-05-21 22:04                               ` Kent Overstreet
2023-05-15 10:29                 ` David Laight
2023-05-10 11:56         ` David Laight
2023-05-09 21:43       ` Darrick J. Wong
2023-05-09 21:54         ` Kent Overstreet
2023-05-11  5:33           ` Theodore Ts'o
2023-05-11  5:44             ` Kent Overstreet
2023-05-13 13:25       ` Lorenzo Stoakes
2023-05-14 18:39         ` Christophe Leroy
2023-05-14 23:43           ` Kent Overstreet
2023-05-15  4:45             ` Christophe Leroy
2023-05-15  5:02               ` Kent Overstreet
2023-05-10 14:18   ` Christophe Leroy
2023-05-10 15:05   ` Johannes Thumshirn
2023-05-11 22:28     ` Kees Cook
2023-05-12 18:41       ` Kent Overstreet
2023-05-16 21:02         ` Kees Cook
2023-05-16 21:20           ` Kent Overstreet
2023-05-16 21:47             ` Matthew Wilcox
2023-05-16 21:57               ` Kent Overstreet
2023-05-17  5:28               ` Kent Overstreet
2023-05-17 14:04                 ` Mike Rapoport
2023-05-17 14:18                   ` Kent Overstreet
2023-05-17 15:44                     ` Mike Rapoport
2023-05-17 15:59                       ` Kent Overstreet
2023-06-17  4:13             ` Andy Lutomirski
2023-06-17 15:34               ` Kent Overstreet
2023-06-17 19:19                 ` Andy Lutomirski
2023-06-17 20:08                   ` Kent Overstreet
2023-06-17 20:35                     ` Andy Lutomirski
2023-06-19 19:45                 ` Kees Cook
2023-06-20  0:39                   ` Kent Overstreet
2023-06-19  9:19   ` Mark Rutland
2023-06-19 10:47     ` Kent Overstreet
2023-06-19 12:47       ` Mark Rutland
2023-06-19 19:17         ` Kent Overstreet
2023-06-20 17:42           ` Andy Lutomirski
2023-06-20 18:08             ` Kent Overstreet
2023-06-20 18:15               ` Andy Lutomirski
2023-06-20 18:48                 ` Dave Hansen
2023-06-20 20:18                   ` Kent Overstreet
2023-06-20 20:42                   ` Andy Lutomirski
2023-06-20 22:32                     ` Andy Lutomirski
2023-06-20 22:43                       ` Nadav Amit
2023-06-21  1:27                         ` Andy Lutomirski
2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
2023-05-10  1:10   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
2023-05-09 17:05   ` Coly Li
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
2023-05-23  9:27   ` (subset) " Christian Brauner
2023-05-23 22:53     ` Dave Chinner
2023-05-24  6:44       ` Christoph Hellwig
2023-05-24  7:35         ` Dave Chinner
2023-05-24  8:31           ` Christian Brauner
2023-05-24  8:41             ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
2023-05-10  4:48   ` Dave Chinner
2023-05-23  9:27   ` (subset) " Christian Brauner
2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
2023-05-10  4:45   ` Dave Chinner
2023-05-16 15:45     ` Christian Brauner
2023-05-16 16:17       ` Kent Overstreet
2023-05-16 23:15         ` Dave Chinner
2023-05-22 13:04           ` Christian Brauner
2023-05-23  9:28   ` (subset) " Christian Brauner
2023-10-19 15:30     ` Mateusz Guzik
2023-10-19 15:59       ` Mateusz Guzik
2023-10-20 11:38         ` Dave Chinner
2023-10-20 17:49           ` Mateusz Guzik
2023-10-21 12:13             ` Mateusz Guzik
2023-10-23  5:10             ` Dave Chinner
2023-10-27 17:13               ` Mateusz Guzik
2023-10-27 18:36                 ` Darrick J. Wong
2023-10-31 11:02                 ` Christian Brauner
2023-10-31 11:31                   ` Mateusz Guzik
2023-11-02  2:36                   ` Kent Overstreet
2023-11-04 20:51                     ` Dave Chinner
2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
2023-05-10  2:20   ` kernel test robot
2023-05-11  2:08   ` kernel test robot
2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
2023-06-19  9:10   ` Mark Rutland
2023-06-19 11:16     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
2023-07-12 19:58   ` Kees Cook
2023-07-12 20:19     ` Kent Overstreet
2023-07-12 22:38       ` Kees Cook
2023-07-12 23:53         ` Kent Overstreet
2023-07-12 20:23     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
2023-05-09 21:04   ` Randy Dunlap
2023-05-09 21:07     ` Kent Overstreet
2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
2023-06-15 21:26   ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).