linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] RFC: assorted bcachefs patches
@ 2018-05-18  7:48 Kent Overstreet
  2018-05-18  7:49 ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:48 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

These are all the remaining patches in my bcachefs tree that touch stuff outside
fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
some discussion first.

 * pagecache add lock

This is the only one that touches existing code in nontrivial ways.  The problem
it's solving is that there is no existing general mechanism for shooting down
pages in the page and keeping them removed, which is a real problem if you're
doing anything that modifies file data and isn't buffered writes.

Historically, the only problematic case has been direct IO, and people have been
willing to say "well, if you mix buffered and direct IO you get what you
deserve", and that's probably not unreasonable. But now we have fallocate insert
range and collapse range, and those are broken in ways I frankly don't want to
think about if they can't ensure consistency with the page cache.

Also, the mechanism truncate uses (i_size and sacrificing a goat) has
historically been rather fragile, IMO it might be a good think if we switched it
to a more general rigorous mechanism.

I need this solved for bcachefs because without this mechanism, the page cache
inconsistencies lead to various assertions popping (primarily when we didn't
think we need to get a disk reservation going by page cache state, but then do
the actual write and disk space accounting says oops, we did need one). And
having to reason about what can happen without a locking mechanism for this is
not something I care to spend brain cycles on.

That said, my patch is kind of ugly, and it requires filesystem changes for
other filesystems to take advantage of it. And unfortunately, since one of the
code paths that needs locking is readahead, I don't see any realistic way of
implementing the locking within just bcachefs code.

So I'm hoping someone has an idea for something cleaner (I think I recall
Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
if not I'll polish up my pagecache add lock patch and see what I can do to make
it less ugly, and hopefully other people find it palatable or at least useful.

 * lglocks

They were removed by Peter Zijlstra when the last in kernel user was removed,
but I've found them useful. His commit message seems to imply he doesn't think
people should be using them, but I'm not sure why. They are a bit niche though,
I can move them to fs/bcachefs if people would prefer. 

 * Generic radix trees

This is a very simple radix tree implementation that can store types of
arbitrary size, not just pointers/unsigned long. It could probably replace
flex arrays.

 * Dynamic fault injection

I've actually had this code sitting in my tree since forever... I know we have
an existing fault injection framework, but I think this one is quite a bit nicer
to actually use.

It works very much like the dynamic debug infrastructure - for those who aren't
familiar, dynamic debug makes it so you can list and individually enable/disable
every pr_debug() callsite in debugfs.

So to add a fault injection site with this, you just stick a call to
dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true if
you should fail whatever it is you're testing. And then it'll show up in
debugfs, where you can enable/disable faults by file/linenumber, module, name,
etc.

The patch then also adds macros that wrap all the various memory allocation
functions and fail if dynamic_fault("memory") returns true - which means you can
see in debugfs every place you're allocating memory and fail all of them or just
individually (I have tests that iterate over all the faults and flip them on one
by one). I also use it in bcachefs to add fault injection points for uncommon
error paths in the filesystem startup/recovery path, and for various hard to
test slowpaths that only happen if we race in weird ways (race_fault()).

Kent Overstreet (10):
  mm: pagecache add lock
  mm: export find_get_pages()
  locking: bring back lglocks
  locking: export osq_lock()/osq_unlock()
  don't use spin_lock_irqsave() unnecessarily
  Generic radix trees
  bcache: optimize continue_at_nobarrier()
  bcache: move closures to lib/
  closures: closure_wait_event()
  Dynamic fault injection

 drivers/md/bcache/Kconfig                     |  10 +-
 drivers/md/bcache/Makefile                    |   6 +-
 drivers/md/bcache/bcache.h                    |   2 +-
 drivers/md/bcache/super.c                     |   1 -
 drivers/md/bcache/util.h                      |   3 +-
 fs/inode.c                                    |   1 +
 include/asm-generic/vmlinux.lds.h             |   4 +
 .../md/bcache => include/linux}/closure.h     |  50 +-
 include/linux/dynamic_fault.h                 | 117 +++
 include/linux/fs.h                            |  23 +
 include/linux/generic-radix-tree.h            | 131 +++
 include/linux/lglock.h                        |  97 +++
 include/linux/sched.h                         |   4 +
 init/init_task.c                              |   1 +
 kernel/locking/Makefile                       |   1 +
 kernel/locking/lglock.c                       | 105 +++
 kernel/locking/osq_lock.c                     |   2 +
 lib/Kconfig                                   |   3 +
 lib/Kconfig.debug                             |  14 +
 lib/Makefile                                  |   7 +-
 {drivers/md/bcache => lib}/closure.c          |  17 +-
 lib/dynamic_fault.c                           | 760 ++++++++++++++++++
 lib/generic-radix-tree.c                      | 167 ++++
 mm/filemap.c                                  |  92 ++-
 mm/page-writeback.c                           |   5 +-
 25 files changed, 1577 insertions(+), 46 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (92%)
 create mode 100644 include/linux/dynamic_fault.h
 create mode 100644 include/linux/generic-radix-tree.h
 create mode 100644 include/linux/lglock.h
 create mode 100644 kernel/locking/lglock.c
 rename {drivers/md/bcache => lib}/closure.c (95%)
 create mode 100644 lib/dynamic_fault.c
 create mode 100644 lib/generic-radix-tree.c

-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/10] mm: pagecache add lock
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 13:13   ` Matthew Wilcox
  2018-05-18  7:49 ` [PATCH 02/10] mm: export find_get_pages() Kent Overstreet
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/inode.c            |  1 +
 include/linux/fs.h    | 23 +++++++++++
 include/linux/sched.h |  4 ++
 init/init_task.c      |  1 +
 mm/filemap.c          | 91 ++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ef362364d3..e7aaa39adb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -350,6 +350,7 @@ void address_space_init_once(struct address_space *mapping)
 {
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT);
+	pagecache_lock_init(&mapping->add_lock);
 	spin_lock_init(&mapping->tree_lock);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf76761..18d2886a44 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,9 +388,32 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+	atomic_long_t		v;
+	wait_queue_head_t	wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+	atomic_long_set(&lock->v, 0);
+	init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
+	struct pagecache_lock	add_lock;	/* protects adding new pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	atomic_t		i_mmap_writable;/* count VM_SHARED mappings */
 	struct rb_root_cached	i_mmap;		/* tree of private and shared mappings */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a90..e58465f61a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -40,6 +40,7 @@ struct io_context;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
+struct pagecache_lock;
 struct perf_event_context;
 struct pid_namespace;
 struct pipe_inode_info;
@@ -865,6 +866,9 @@ struct task_struct {
 	unsigned int			in_ubsan;
 #endif
 
+	/* currently held lock, for avoiding recursing in fault path: */
+	struct pagecache_lock *pagecache_lock;
+
 	/* Journalling filesystem info: */
 	void				*journal_info;
 
diff --git a/init/init_task.c b/init/init_task.c
index 3ac6e754cf..308d46eef9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -106,6 +106,7 @@ struct task_struct init_task
 	},
 	.blocked	= {{0}},
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
+	.pagecache_lock = NULL,
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a..31dd888785 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,6 +111,73 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+	BUG_ON(atomic_long_read(&lock->v) == 0);
+
+	if (atomic_long_sub_return_release(i, &lock->v) == 0)
+		wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+	long v = atomic_long_read(&lock->v), old;
+
+	do {
+		old = v;
+
+		if (i > 0 ? v < 0 : v > 0)
+			return false;
+	} while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+					old, old + i)) != old);
+	return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+	wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_put);
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_get);
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_put);
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_get);
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+	BUG_ON(current->pagecache_lock != lock);
+	current->pagecache_lock = NULL;
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(pagecache_block_put);
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+	BUG_ON(current->pagecache_lock);
+	current->pagecache_lock = lock;
+}
+EXPORT_SYMBOL(pagecache_block_get);
+
 static int page_cache_tree_insert(struct address_space *mapping,
 				  struct page *page, void **shadowp)
 {
@@ -834,18 +901,21 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
+
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
 		if (error)
-			return error;
+			goto err;
 	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
 			mem_cgroup_cancel_charge(page, memcg, false);
-		return error;
+		goto err;
 	}
 
 	get_page(page);
@@ -865,7 +935,11 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
+err:
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
+
+	return error;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
@@ -873,7 +947,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
-	return error;
+	goto err;
 }
 
 /**
@@ -2511,7 +2585,14 @@ int filemap_fault(struct vm_fault *vmf)
 	 * Do we have something in the page cache already?
 	 */
 	page = find_get_page(mapping, offset);
-	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+	if (unlikely(current->pagecache_lock == &mapping->add_lock)) {
+		/*
+		 * fault from e.g. dio -> get_user_pages() - _don't_ want to do
+		 * readahead, only read in page we need:
+		 */
+		if (!page)
+			goto no_cached_page;
+	} else if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 02/10] mm: export find_get_pages()
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
  2018-05-18  7:49 ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 16:00   ` Christoph Hellwig
  2018-05-18  7:49 ` [PATCH 03/10] locking: bring back lglocks Kent Overstreet
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 31dd888785..78b99019bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
 
 	return ret;
 }
+EXPORT_SYMBOL(find_get_pages_range);
 
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 03/10] locking: bring back lglocks
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
  2018-05-18  7:49 ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
  2018-05-18  7:49 ` [PATCH 02/10] mm: export find_get_pages() Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18  9:51   ` Peter Zijlstra
  2018-05-18  7:49 ` [PATCH 04/10] locking: export osq_lock()/osq_unlock() Kent Overstreet
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

bcachefs makes use of them - also, add a proper lg_lock_init()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/lglock.h  |  97 +++++++++++++++++++++++++++++++++++++
 kernel/locking/Makefile |   1 +
 kernel/locking/lglock.c | 105 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 203 insertions(+)
 create mode 100644 include/linux/lglock.h
 create mode 100644 kernel/locking/lglock.c

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
new file mode 100644
index 0000000000..c1fbe64bd2
--- /dev/null
+++ b/include/linux/lglock.h
@@ -0,0 +1,97 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ *   another CPU's data allowed but possibly subject to contention, and to
+ *   provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ *   very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
+
+#ifdef CONFIG_SMP
+
+struct lglock {
+	arch_spinlock_t __percpu *lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map    dep_map;
+#endif
+};
+
+#define DEFINE_LGLOCK(name)						\
+	static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)		\
+	= __ARCH_SPIN_LOCK_UNLOCKED;					\
+	struct lglock name = { .lock = &name ## _lock }
+
+#define DEFINE_STATIC_LGLOCK(name)					\
+	static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)		\
+	= __ARCH_SPIN_LOCK_UNLOCKED;					\
+	static struct lglock name = { .lock = &name ## _lock }
+
+static inline void lg_lock_free(struct lglock *lg)
+{
+	free_percpu(lg->lock);
+}
+
+#define lg_lock_lockdep_init(lock)					\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	lockdep_init_map(&(lock)->dep_map, #lock, &__key, 0);		\
+} while (0)
+
+static inline int __lg_lock_init(struct lglock *lg)
+{
+	lg->lock = alloc_percpu(*lg->lock);
+	return lg->lock ? 0 : -ENOMEM;
+}
+
+#define lg_lock_init(lock)						\
+({									\
+	lg_lock_lockdep_init(lock);					\
+	__lg_lock_init(lock);						\
+})
+
+void lg_local_lock(struct lglock *lg);
+void lg_local_unlock(struct lglock *lg);
+void lg_local_lock_cpu(struct lglock *lg, int cpu);
+void lg_local_unlock_cpu(struct lglock *lg, int cpu);
+
+void lg_double_lock(struct lglock *lg, int cpu1, int cpu2);
+void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2);
+
+void lg_global_lock(struct lglock *lg);
+void lg_global_unlock(struct lglock *lg);
+
+#else
+/* When !CONFIG_SMP, map lglock to spinlock */
+#define lglock spinlock
+#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
+#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
+#define lg_lock_init(lg)	({ spin_lock_init(lg); 0; })
+#define lg_lock_free(lg)	do {} while (0)
+#define lg_local_lock spin_lock
+#define lg_local_unlock spin_unlock
+#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
+#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
+#define lg_global_lock spin_lock
+#define lg_global_unlock spin_unlock
+#endif
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f23af..e5bb62823d 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
+obj-$(CONFIG_SMP) += lglock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
diff --git a/kernel/locking/lglock.c b/kernel/locking/lglock.c
new file mode 100644
index 0000000000..051feaccc4
--- /dev/null
+++ b/kernel/locking/lglock.c
@@ -0,0 +1,105 @@
+/* See include/linux/lglock.h for description */
+#include <linux/module.h>
+#include <linux/lglock.h>
+#include <linux/cpu.h>
+#include <linux/string.h>
+
+/*
+ * Note there is no uninit, so lglocks cannot be defined in
+ * modules (but it's fine to use them from there)
+ * Could be added though, just undo lg_lock_init
+ */
+
+void lg_local_lock(struct lglock *lg)
+{
+	arch_spinlock_t *lock;
+
+	preempt_disable();
+	lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+	lock = this_cpu_ptr(lg->lock);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock);
+
+void lg_local_unlock(struct lglock *lg)
+{
+	arch_spinlock_t *lock;
+
+	lock_release(&lg->dep_map, 1, _RET_IP_);
+	lock = this_cpu_ptr(lg->lock);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock);
+
+void lg_local_lock_cpu(struct lglock *lg, int cpu)
+{
+	arch_spinlock_t *lock;
+
+	preempt_disable();
+	lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+	lock = per_cpu_ptr(lg->lock, cpu);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock_cpu);
+
+void lg_local_unlock_cpu(struct lglock *lg, int cpu)
+{
+	arch_spinlock_t *lock;
+
+	lock_release(&lg->dep_map, 1, _RET_IP_);
+	lock = per_cpu_ptr(lg->lock, cpu);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock_cpu);
+
+void lg_double_lock(struct lglock *lg, int cpu1, int cpu2)
+{
+	BUG_ON(cpu1 == cpu2);
+
+	/* lock in cpu order, just like lg_global_lock */
+	if (cpu2 < cpu1)
+		swap(cpu1, cpu2);
+
+	preempt_disable();
+	lock_acquire_shared(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+	arch_spin_lock(per_cpu_ptr(lg->lock, cpu1));
+	arch_spin_lock(per_cpu_ptr(lg->lock, cpu2));
+}
+
+void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2)
+{
+	lock_release(&lg->dep_map, 1, _RET_IP_);
+	arch_spin_unlock(per_cpu_ptr(lg->lock, cpu1));
+	arch_spin_unlock(per_cpu_ptr(lg->lock, cpu2));
+	preempt_enable();
+}
+
+void lg_global_lock(struct lglock *lg)
+{
+	int i;
+
+	preempt_disable();
+	lock_acquire_exclusive(&lg->dep_map, 0, 0, NULL, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_lock(lock);
+	}
+}
+EXPORT_SYMBOL(lg_global_lock);
+
+void lg_global_unlock(struct lglock *lg)
+{
+	int i;
+
+	lock_release(&lg->dep_map, 1, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_unlock(lock);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_global_unlock);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (2 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 03/10] locking: bring back lglocks Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18  9:52   ` Peter Zijlstra
  2018-05-18  7:49 ` [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily Kent Overstreet
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 kernel/locking/osq_lock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f..dfa71347b0 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -202,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
 	return false;
 }
+EXPORT_SYMBOL_GPL(osq_lock);
 
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
@@ -229,3 +230,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	if (next)
 		WRITE_ONCE(next->locked, 1);
 }
+EXPORT_SYMBOL_GPL(osq_unlock);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (3 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 04/10] locking: export osq_lock()/osq_unlock() Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 16:01   ` Christoph Hellwig
  2018-05-18  7:49 ` [PATCH 06/10] Generic radix trees Kent Overstreet
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 mm/page-writeback.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c..17ccc294c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2460,20 +2460,19 @@ int __set_page_dirty_nobuffers(struct page *page)
 	lock_page_memcg(page);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		unsigned long flags;
 
 		if (!mapping) {
 			unlock_page_memcg(page);
 			return 1;
 		}
 
-		spin_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irq(&mapping->tree_lock);
 		BUG_ON(page_mapping(page) != mapping);
 		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree, page_index(page),
 				   PAGECACHE_TAG_DIRTY);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irq(&mapping->tree_lock);
 		unlock_page_memcg(page);
 
 		if (mapping->host) {
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 06/10] Generic radix trees
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (4 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 16:02   ` Christoph Hellwig
  2018-05-18  7:49 ` [PATCH 07/10] bcache: optimize continue_at_nobarrier() Kent Overstreet
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/generic-radix-tree.h | 131 ++++++++++++++++++++++
 lib/Makefile                       |   3 +-
 lib/generic-radix-tree.c           | 167 +++++++++++++++++++++++++++++
 3 files changed, 300 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/generic-radix-tree.h
 create mode 100644 lib/generic-radix-tree.c

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
new file mode 100644
index 0000000000..4ede23e55f
--- /dev/null
+++ b/include/linux/generic-radix-tree.h
@@ -0,0 +1,131 @@
+#ifndef _LINUX_GENERIC_RADIX_TREE_H
+#define _LINUX_GENERIC_RADIX_TREE_H
+
+/*
+ * Generic radix trees/sparse arrays:
+ *
+ * A generic radix tree has all nodes of size PAGE_SIZE - both leaves and
+ * interior nodes.
+ */
+
+#include <asm/page.h>
+#include <linux/bug.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+
+struct genradix_node;
+
+struct __genradix {
+	struct genradix_node		*root;
+	size_t				depth;
+};
+
+/*
+ * NOTE: currently, sizeof(_type) must be a power of two and not larger than
+ * PAGE_SIZE:
+ */
+
+#define __GENRADIX_INITIALIZER					\
+	{							\
+		.tree = {					\
+			.root = NULL,				\
+			.depth = 0,				\
+		}						\
+	}
+
+/*
+ * We use a 0 size array to stash the type we're storing without taking any
+ * space at runtime - then the various accessor macros can use typeof() to get
+ * to it for casts/sizeof - we also force the alignment so that storing a type
+ * with a ridiculous alignment doesn't blow up the alignment or size of the
+ * genradix.
+ */
+
+#define GENRADIX(_type)						\
+struct {							\
+	struct __genradix	tree;				\
+	_type			type[0] __aligned(1);		\
+}
+
+#define DEFINE_GENRADIX(_name, _type)				\
+	GENRADIX(_type) _name = __GENRADIX_INITIALIZER
+
+#define genradix_init(_radix)					\
+do {								\
+	*(_radix) = (typeof(*_radix)) __GENRADIX_INITIALIZER;	\
+} while (0)
+
+void __genradix_free(struct __genradix *);
+
+#define genradix_free(_radix)	__genradix_free(&(_radix)->tree)
+
+static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
+{
+	BUILD_BUG_ON(obj_size > PAGE_SIZE);
+
+	if (!is_power_of_2(obj_size)) {
+		size_t objs_per_page = PAGE_SIZE / obj_size;
+
+		return (idx / objs_per_page) * PAGE_SIZE +
+			(idx % objs_per_page) * obj_size;
+	} else {
+		return idx * obj_size;
+	}
+}
+
+#define __genradix_cast(_radix)		(typeof((_radix)->type[0]) *)
+#define __genradix_obj_size(_radix)	sizeof((_radix)->type[0])
+#define __genradix_idx_to_offset(_radix, _idx)			\
+	__idx_to_offset(_idx, __genradix_obj_size(_radix))
+
+void *__genradix_ptr(struct __genradix *, size_t);
+
+/* Returns a pointer to element at @_idx */
+#define genradix_ptr(_radix, _idx)				\
+	(__genradix_cast(_radix)				\
+	 __genradix_ptr(&(_radix)->tree,			\
+			__genradix_idx_to_offset(_radix, _idx)))
+
+void *__genradix_ptr_alloc(struct __genradix *, size_t, gfp_t);
+
+/* Returns a pointer to element at @_idx, allocating it if necessary */
+#define genradix_ptr_alloc(_radix, _idx, _gfp)			\
+	(__genradix_cast(_radix)				\
+	 __genradix_ptr_alloc(&(_radix)->tree,			\
+			__genradix_idx_to_offset(_radix, _idx),	\
+			_gfp))
+
+struct genradix_iter {
+	size_t			offset;
+	size_t			pos;
+};
+
+#define genradix_iter_init(_radix, _idx)			\
+	((struct genradix_iter) {				\
+		.pos	= (_idx),				\
+		.offset	= __genradix_idx_to_offset((_radix), (_idx)),\
+	})
+
+void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
+
+#define genradix_iter_peek(_iter, _radix)			\
+	(__genradix_cast(_radix)				\
+	 __genradix_iter_peek(_iter, &(_radix)->tree,		\
+			      PAGE_SIZE / __genradix_obj_size(_radix)))
+
+static inline void __genradix_iter_advance(struct genradix_iter *iter,
+					   size_t obj_size)
+{
+	iter->offset += obj_size;
+
+	if (!is_power_of_2(obj_size) &&
+	    (iter->offset & (PAGE_SIZE - 1)) + obj_size > PAGE_SIZE)
+		iter->offset = round_up(iter->offset, PAGE_SIZE);
+
+	iter->pos++;
+}
+
+#define genradix_iter_advance(_iter, _radix)			\
+	__genradix_iter_advance(_iter, __genradix_obj_size(_radix))
+
+#endif /* _LINUX_GENERIC_RADIX_TREE_H */
diff --git a/lib/Makefile b/lib/Makefile
index a90d4fcd74..5db5a7fb1e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o usercopy.o errseq.o bucket_locks.o
+	 once.o refcount.o usercopy.o errseq.o bucket_locks.o \
+	 generic-radix-tree.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
new file mode 100644
index 0000000000..5c4a275ea3
--- /dev/null
+++ b/lib/generic-radix-tree.c
@@ -0,0 +1,167 @@
+
+#include <linux/export.h>
+#include <linux/generic-radix-tree.h>
+#include <linux/gfp.h>
+
+#define GENRADIX_ARY		(PAGE_SIZE / sizeof(struct genradix_node *))
+#define GENRADIX_ARY_SHIFT	ilog2(GENRADIX_ARY)
+
+struct genradix_node {
+	union {
+		/* Interior node: */
+		struct genradix_node	*children[GENRADIX_ARY];
+
+		/* Leaf: */
+		u8			data[PAGE_SIZE];
+	};
+};
+
+static inline unsigned genradix_depth_shift(unsigned depth)
+{
+	return PAGE_SHIFT + GENRADIX_ARY_SHIFT * depth;
+}
+
+/*
+ * Returns size (of data, in bytes) that a tree of a given depth holds:
+ */
+static inline size_t genradix_depth_size(unsigned depth)
+{
+	return 1UL << genradix_depth_shift(depth);
+}
+
+/*
+ * Returns pointer to the specified byte @offset within @radix, or NULL if not
+ * allocated
+ */
+void *__genradix_ptr(struct __genradix *radix, size_t offset)
+{
+	size_t level = radix->depth;
+	struct genradix_node *n = radix->root;
+
+	if (offset >= genradix_depth_size(radix->depth))
+		return NULL;
+
+	while (1) {
+		if (!n)
+			return NULL;
+		if (!level)
+			break;
+
+		level--;
+
+		n = n->children[offset >> genradix_depth_shift(level)];
+		offset &= genradix_depth_size(level) - 1;
+	}
+
+	return &n->data[offset];
+}
+EXPORT_SYMBOL(__genradix_ptr);
+
+/*
+ * Returns pointer to the specified byte @offset within @radix, allocating it if
+ * necessary - newly allocated slots are always zeroed out:
+ */
+void *__genradix_ptr_alloc(struct __genradix *radix, size_t offset,
+			   gfp_t gfp_mask)
+{
+	struct genradix_node **n;
+	size_t level;
+
+	/* Increase tree depth if necessary: */
+
+	while (offset >= genradix_depth_size(radix->depth)) {
+		struct genradix_node *new_root =
+			(void *) __get_free_page(gfp_mask|__GFP_ZERO);
+
+		if (!new_root)
+			return NULL;
+
+		new_root->children[0] = radix->root;
+		radix->root = new_root;
+		radix->depth++;
+	}
+
+	n = &radix->root;
+	level = radix->depth;
+
+	while (1) {
+		if (!*n) {
+			*n = (void *) __get_free_page(gfp_mask|__GFP_ZERO);
+			if (!*n)
+				return NULL;
+		}
+
+		if (!level)
+			break;
+
+		level--;
+
+		n = &(*n)->children[offset >> genradix_depth_shift(level)];
+		offset &= genradix_depth_size(level) - 1;
+	}
+
+	return &(*n)->data[offset];
+}
+EXPORT_SYMBOL(__genradix_ptr_alloc);
+
+void *__genradix_iter_peek(struct genradix_iter *iter,
+			   struct __genradix *radix,
+			   size_t objs_per_page)
+{
+	struct genradix_node *n;
+	size_t level, i;
+
+	if (!radix->root)
+		return NULL;
+restart:
+	if (iter->offset >= genradix_depth_size(radix->depth))
+		return NULL;
+
+	n	= radix->root;
+	level	= radix->depth;
+
+	while (level) {
+		level--;
+
+		i = (iter->offset >> genradix_depth_shift(level)) &
+			(GENRADIX_ARY - 1);
+
+		while (!n->children[i]) {
+			i++;
+			iter->offset = round_down(iter->offset +
+					   genradix_depth_size(level),
+					   genradix_depth_size(level));
+			iter->pos = (iter->offset >> PAGE_SHIFT) *
+				objs_per_page;
+			if (i == GENRADIX_ARY)
+				goto restart;
+		}
+
+		n = n->children[i];
+	}
+
+	return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek);
+
+static void genradix_free_recurse(struct genradix_node *n, unsigned level)
+{
+	if (level) {
+		unsigned i;
+
+		for (i = 0; i < GENRADIX_ARY; i++)
+			if (n->children[i])
+				genradix_free_recurse(n->children[i], level - 1);
+	}
+
+	free_page((unsigned long) n);
+}
+
+void __genradix_free(struct __genradix *radix)
+{
+	genradix_free_recurse(radix->root, radix->depth);
+
+	radix->root = NULL;
+	radix->depth = 0;
+}
+EXPORT_SYMBOL(__genradix_free);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 07/10] bcache: optimize continue_at_nobarrier()
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (5 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 06/10] Generic radix trees Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18  7:49 ` [PATCH 08/10] bcache: move closures to lib/ Kent Overstreet
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/closure.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 3b9dfc9962..2392a46bcd 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -244,7 +244,7 @@ static inline void closure_queue(struct closure *cl)
 		     != offsetof(struct work_struct, func));
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
-		BUG_ON(!queue_work(wq, &cl->work));
+		queue_work(wq, &cl->work);
 	} else
 		cl->fn(cl);
 }
@@ -337,8 +337,13 @@ do {									\
  */
 #define continue_at_nobarrier(_cl, _fn, _wq)				\
 do {									\
-	set_closure_fn(_cl, _fn, _wq);					\
-	closure_queue(_cl);						\
+	closure_set_ip(_cl);						\
+	if (_wq) {							\
+		INIT_WORK(&(_cl)->work, (void *) _fn);			\
+		queue_work((_wq), &(_cl)->work);			\
+	} else {							\
+		(_fn)(_cl);						\
+	}								\
 } while (0)
 
 /**
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 08/10] bcache: move closures to lib/
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (6 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 07/10] bcache: optimize continue_at_nobarrier() Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 16:02   ` Christoph Hellwig
  2018-05-18  7:49 ` [PATCH 09/10] closures: closure_wait_event() Kent Overstreet
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/Kconfig                      | 10 +---------
 drivers/md/bcache/Makefile                     |  6 +++---
 drivers/md/bcache/bcache.h                     |  2 +-
 drivers/md/bcache/super.c                      |  1 -
 drivers/md/bcache/util.h                       |  3 +--
 {drivers/md/bcache => include/linux}/closure.h | 17 ++++++++---------
 lib/Kconfig                                    |  3 +++
 lib/Kconfig.debug                              |  9 +++++++++
 lib/Makefile                                   |  2 ++
 {drivers/md/bcache => lib}/closure.c           | 17 ++++++++---------
 10 files changed, 36 insertions(+), 34 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (95%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 4d200883c5..45f1094c08 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -1,6 +1,7 @@
 
 config BCACHE
 	tristate "Block device as cache"
+	select CLOSURES
 	---help---
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -15,12 +16,3 @@ config BCACHE_DEBUG
 
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
-
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	---help---
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index d26b351958..2b790fb813 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,8 +2,8 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
-	util.o writeback.o
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
+	writeback.o
 
 CFLAGS_request.o	+= -Iblock
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 12e5197f18..d954dc44dd 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -180,6 +180,7 @@
 
 #include <linux/bcache.h>
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -191,7 +192,6 @@
 
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f2273143b3..5f1ac8e0a3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2148,7 +2148,6 @@ static int __init bcache_init(void)
 	mutex_init(&bch_register_lock);
 	init_waitqueue_head(&unregister_wait);
 	register_reboot_notifier(&reboot);
-	closure_debug_init();
 
 	bcache_major = register_blkdev(0, "bcache");
 	if (bcache_major < 0) {
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index a6763db7f0..a75523ed0d 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -12,8 +13,6 @@
 #include <linux/vmalloc.h>
 #include <linux/workqueue.h>
 
-#include "closure.h"
-
 #define PAGE_SECTORS		(PAGE_SIZE / 512)
 
 struct closure;
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index 2392a46bcd..1072bf2c13 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -154,7 +154,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -183,15 +183,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -199,21 +197,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -242,6 +240,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		queue_work(wq, &cl->work);
@@ -254,7 +253,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -270,7 +269,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index e960894993..eff939519e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -393,6 +393,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 64155e310a..427292a759 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1362,6 +1362,15 @@ config DEBUG_CREDENTIALS
 
 source "kernel/rcu/Kconfig.debug"
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	---help---
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config DEBUG_WQ_FORCE_RR_CPU
 	bool "Force round-robin CPU selection for unbound work items"
 	depends on DEBUG_KERNEL
diff --git a/lib/Makefile b/lib/Makefile
index 5db5a7fb1e..66d2231d70 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -172,6 +172,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_CORDIC) += cordic.o
 
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 95%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index 7f12920c14..025f8d8f48 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -5,13 +5,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -126,7 +125,7 @@ void __sched __closure_sync(struct closure *cl)
 }
 EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -162,6 +161,7 @@ static struct dentry *debug;
 static int debug_seq_show(struct seq_file *f, void *data)
 {
 	struct closure *cl;
+
 	spin_lock_irq(&closure_list_lock);
 
 	list_for_each_entry(cl, &closure_list, all) {
@@ -180,7 +180,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pF\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -199,12 +199,11 @@ static const struct file_operations debug_ops = {
 	.release	= single_release
 };
 
-void __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
 	debug = debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
+	return 0;
 }
+late_initcall(closure_debug_init)
 
 #endif
-
-MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
-MODULE_LICENSE("GPL");
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 09/10] closures: closure_wait_event()
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (7 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 08/10] bcache: move closures to lib/ Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18  7:49 ` [PATCH 10/10] Dynamic fault injection Kent Overstreet
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 1072bf2c13..ef22d18f7c 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -375,4 +375,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 10/10] Dynamic fault injection
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (8 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 09/10] closures: closure_wait_event() Kent Overstreet
@ 2018-05-18  7:49 ` Kent Overstreet
  2018-05-18 16:02   ` Christoph Hellwig
  2018-05-18 19:05   ` Andreas Dilger
  2018-05-18  7:55 ` [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
  2018-05-18 17:45 ` Josef Bacik
  11 siblings, 2 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:49 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Kent Overstreet, Andrew Morton, Dave Chinner, darrick.wong,
	tytso, linux-btrfs, clm, jbacik, viro, willy, peterz

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/asm-generic/vmlinux.lds.h |   4 +
 include/linux/dynamic_fault.h     | 117 +++++
 lib/Kconfig.debug                 |   5 +
 lib/Makefile                      |   2 +
 lib/dynamic_fault.c               | 760 ++++++++++++++++++++++++++++++
 5 files changed, 888 insertions(+)
 create mode 100644 include/linux/dynamic_fault.h
 create mode 100644 lib/dynamic_fault.c

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6..a4c9dfcbbd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -246,6 +246,10 @@
 	VMLINUX_SYMBOL(__start___verbose) = .;                          \
 	KEEP(*(__verbose))                                              \
 	VMLINUX_SYMBOL(__stop___verbose) = .;				\
+	. = ALIGN(8);							\
+	VMLINUX_SYMBOL(__start___faults) = .;                           \
+	*(__faults)                                                     \
+	VMLINUX_SYMBOL(__stop___faults) = .;				\
 	LIKELY_PROFILE()		       				\
 	BRANCH_PROFILE()						\
 	TRACE_PRINTKS()							\
diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
new file mode 100644
index 0000000000..6e7bb56ae8
--- /dev/null
+++ b/include/linux/dynamic_fault.h
@@ -0,0 +1,117 @@
+#ifndef _DYNAMIC_FAULT_H
+#define _DYNAMIC_FAULT_H
+
+#include <linux/bio.h>
+#include <linux/jump_label.h>
+#include <linux/slab.h>
+
+enum dfault_enabled {
+	DFAULT_DISABLED,
+	DFAULT_ENABLED,
+	DFAULT_ONESHOT,
+};
+
+union dfault_state {
+	struct {
+		unsigned		enabled:2;
+		unsigned		count:30;
+	};
+
+	struct {
+		unsigned		v;
+	};
+};
+
+/*
+ * An instance of this structure is created in a special
+ * ELF section at every dynamic fault callsite.  At runtime,
+ * the special section is treated as an array of these.
+ */
+struct _dfault {
+	const char		*modname;
+	const char		*function;
+	const char		*filename;
+	const char		*class;
+
+	const u16		line;
+
+	unsigned		frequency;
+	union dfault_state	state;
+
+	struct static_key	enabled;
+} __aligned(8);
+
+
+#ifdef CONFIG_DYNAMIC_FAULT
+
+int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
+int dfault_remove_module(char *mod_name);
+bool __dynamic_fault_enabled(struct _dfault *);
+
+#define dynamic_fault(_class)						\
+({									\
+	static struct _dfault descriptor				\
+	__used __aligned(8) __attribute__((section("__faults"))) = {	\
+		.modname	= KBUILD_MODNAME,			\
+		.function	= __func__,				\
+		.filename	= __FILE__,				\
+		.line		= __LINE__,				\
+		.class		= _class,				\
+	};								\
+									\
+	static_key_false(&descriptor.enabled) &&			\
+		__dynamic_fault_enabled(&descriptor);			\
+})
+
+#define memory_fault()		dynamic_fault("memory")
+#define race_fault()		dynamic_fault("race")
+
+#define kmalloc(...)							\
+	(memory_fault() ? NULL	: kmalloc(__VA_ARGS__))
+#define kzalloc(...)							\
+	(memory_fault() ? NULL	: kzalloc(__VA_ARGS__))
+#define krealloc(...)							\
+	(memory_fault() ? NULL	: krealloc(__VA_ARGS__))
+
+#define mempool_alloc(pool, gfp_mask)					\
+	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
+		? NULL : mempool_alloc(pool, gfp_mask))
+
+#define __get_free_pages(...)						\
+	(memory_fault() ? 0	: __get_free_pages(__VA_ARGS__))
+#define alloc_pages_node(...)						\
+	(memory_fault() ? NULL	: alloc_pages_node(__VA_ARGS__))
+#define alloc_pages_nodemask(...)					\
+	(memory_fault() ? NULL	: alloc_pages_nodemask(__VA_ARGS__))
+
+#define bio_alloc_bioset(gfp_mask, ...)					\
+	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
+	 ? NULL	: bio_alloc_bioset(gfp_mask, __VA_ARGS__))
+
+#define bio_clone(bio, gfp_mask)					\
+	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
+	 ? NULL	: bio_clone(bio, gfp_mask))
+
+#define bio_clone_bioset(bio, gfp_mask, bs)				\
+	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
+	 ? NULL	: bio_clone_bioset(bio, gfp_mask, bs))
+
+#define bio_kmalloc(...)						\
+	(memory_fault() ? NULL		: bio_kmalloc(__VA_ARGS__))
+#define bio_clone_kmalloc(...)						\
+	(memory_fault() ? NULL		: bio_clone_kmalloc(__VA_ARGS__))
+
+#define bio_iov_iter_get_pages(...)					\
+	(memory_fault() ? -ENOMEM	: bio_iov_iter_get_pages(__VA_ARGS__))
+
+#else /* CONFIG_DYNAMIC_FAULT */
+
+#define dfault_add_module(tab, n, modname)	0
+#define dfault_remove_module(mod)		0
+#define dynamic_fault(_class)			0
+#define memory_fault()				0
+#define race_fault()				0
+
+#endif /* CONFIG_DYNAMIC_FAULT */
+
+#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 427292a759..7b7ca0d813 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1599,6 +1599,11 @@ config LATENCYTOP
 	  Enable this option if you want to use the LatencyTOP tool
 	  to find out which userspace is blocking on what kernel operations.
 
+config DYNAMIC_FAULT
+	bool "Enable dynamic fault support"
+	default n
+	depends on DEBUG_FS
+
 source kernel/trace/Kconfig
 
 config PROVIDE_OHCI1394_DMA_INIT
diff --git a/lib/Makefile b/lib/Makefile
index 66d2231d70..f6f70f4771 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -158,6 +158,8 @@ obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
 
 obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
 
+obj-$(CONFIG_DYNAMIC_FAULT) += dynamic_fault.o
+
 obj-$(CONFIG_NLATTR) += nlattr.o
 
 obj-$(CONFIG_LRU_CACHE) += lru_cache.o
diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
new file mode 100644
index 0000000000..75fc9a1b4b
--- /dev/null
+++ b/lib/dynamic_fault.c
@@ -0,0 +1,760 @@
+/*
+ * lib/dynamic_fault.c
+ *
+ * make dynamic_fault() calls runtime configurable based upon their
+ * source module.
+ *
+ * Copyright (C) 2011 Adam Berkan <aberkan@google.com>
+ * Based on dynamic_debug.c:
+ * Copyright (C) 2008 Jason Baron <jbaron@redhat.com>
+ * By Greg Banks <gnb@melbourne.sgi.com>
+ * Copyright (c) 2008 Silicon Graphics Inc.  All Rights Reserved.
+ *
+ */
+
+#define pr_fmt(fmt) "dfault: " fmt "\n"
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kallsyms.h>
+#include <linux/version.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/list.h>
+#include <linux/sysctl.h>
+#include <linux/ctype.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/dynamic_fault.h>
+#include <linux/debugfs.h>
+#include <linux/slab.h>
+
+#undef kzalloc
+
+extern struct _dfault __start___faults[];
+extern struct _dfault __stop___faults[];
+
+struct dfault_table {
+	struct list_head link;
+	char *mod_name;
+	unsigned int num_dfaults;
+	struct _dfault *dfaults;
+};
+
+struct dfault_query {
+	const char	*filename;
+	const char	*module;
+	const char	*function;
+	const char	*class;
+	unsigned int	first_line, last_line;
+	unsigned int	first_index, last_index;
+
+	unsigned	match_line:1;
+	unsigned	match_index:1;
+
+	unsigned	set_enabled:1;
+	unsigned	enabled:2;
+
+	unsigned	set_frequency:1;
+	unsigned	frequency;
+};
+
+struct dfault_iter {
+	struct dfault_table *table;
+	unsigned int idx;
+};
+
+static DEFINE_MUTEX(dfault_lock);
+static LIST_HEAD(dfault_tables);
+
+bool __dynamic_fault_enabled(struct _dfault *df)
+{
+	union dfault_state old, new;
+	unsigned v = df->state.v;
+	bool ret;
+
+	do {
+		old.v = new.v = v;
+
+		if (new.enabled == DFAULT_DISABLED)
+			return false;
+
+		ret = df->frequency
+			? ++new.count >= df->frequency
+			: true;
+		if (ret)
+			new.count = 0;
+		if (ret && new.enabled == DFAULT_ONESHOT)
+			new.enabled = DFAULT_DISABLED;
+	} while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
+
+	if (ret)
+		pr_debug("returned true for %s:%u", df->filename, df->line);
+
+	return ret;
+}
+EXPORT_SYMBOL(__dynamic_fault_enabled);
+
+/* Return the last part of a pathname */
+static inline const char *basename(const char *path)
+{
+	const char *tail = strrchr(path, '/');
+
+	return tail ? tail + 1 : path;
+}
+
+/* format a string into buf[] which describes the _dfault's flags */
+static char *dfault_describe_flags(struct _dfault *df, char *buf, size_t buflen)
+{
+	switch (df->state.enabled) {
+	case DFAULT_DISABLED:
+		strlcpy(buf, "disabled", buflen);
+		break;
+	case DFAULT_ENABLED:
+		strlcpy(buf, "enabled", buflen);
+		break;
+	case DFAULT_ONESHOT:
+		strlcpy(buf, "oneshot", buflen);
+		break;
+	default:
+		BUG();
+	}
+
+	return buf;
+}
+
+/*
+ * must be called with dfault_lock held
+ */
+
+/*
+ * Search the tables for _dfault's which match the given
+ * `query' and apply the `flags' and `mask' to them.  Tells
+ * the user which dfault's were changed, or whether none
+ * were matched.
+ */
+static int dfault_change(const struct dfault_query *query)
+{
+	struct dfault_table *dt;
+	unsigned int nfound = 0;
+	unsigned i, index = 0;
+	char flagbuf[16];
+
+	/* search for matching dfaults */
+	mutex_lock(&dfault_lock);
+	list_for_each_entry(dt, &dfault_tables, link) {
+
+		/* match against the module name */
+		if (query->module != NULL &&
+		    strcmp(query->module, dt->mod_name))
+			continue;
+
+		for (i = 0 ; i < dt->num_dfaults ; i++) {
+			struct _dfault *df = &dt->dfaults[i];
+
+			/* match against the source filename */
+			if (query->filename != NULL &&
+			    strcmp(query->filename, df->filename) &&
+			    strcmp(query->filename, basename(df->filename)))
+				continue;
+
+			/* match against the function */
+			if (query->function != NULL &&
+			    strcmp(query->function, df->function))
+				continue;
+
+			/* match against the class */
+			if (query->class) {
+				size_t len = strlen(query->class);
+
+				if (strncmp(query->class, df->class, len))
+					continue;
+
+				if (df->class[len] && df->class[len] != ':')
+					continue;
+			}
+
+			/* match against the line number range */
+			if (query->match_line &&
+			    (df->line < query->first_line ||
+			     df->line > query->last_line))
+				continue;
+
+			/* match against the fault index */
+			if (query->match_index &&
+			    (index < query->first_index ||
+			     index > query->last_index)) {
+				index++;
+				continue;
+			}
+
+			if (query->set_enabled &&
+			    query->enabled != df->state.enabled) {
+				if (query->enabled != DFAULT_DISABLED)
+					static_key_slow_inc(&df->enabled);
+				else if (df->state.enabled != DFAULT_DISABLED)
+					static_key_slow_dec(&df->enabled);
+
+				df->state.enabled = query->enabled;
+			}
+
+			if (query->set_frequency)
+				df->frequency = query->frequency;
+
+			pr_debug("changed %s:%d [%s]%s #%d %s",
+				 df->filename, df->line, dt->mod_name,
+				 df->function, index,
+				 dfault_describe_flags(df, flagbuf,
+						       sizeof(flagbuf)));
+
+			index++;
+			nfound++;
+		}
+	}
+	mutex_unlock(&dfault_lock);
+
+	pr_debug("dfault: %u matches", nfound);
+
+	return nfound ? 0 : -ENOENT;
+}
+
+/*
+ * Split the buffer `buf' into space-separated words.
+ * Handles simple " and ' quoting, i.e. without nested,
+ * embedded or escaped \".  Return the number of words
+ * or <0 on error.
+ */
+static int dfault_tokenize(char *buf, char *words[], int maxwords)
+{
+	int nwords = 0;
+
+	while (*buf) {
+		char *end;
+
+		/* Skip leading whitespace */
+		buf = skip_spaces(buf);
+		if (!*buf)
+			break;	/* oh, it was trailing whitespace */
+
+		/* Run `end' over a word, either whitespace separated or quoted
+		 */
+		if (*buf == '"' || *buf == '\'') {
+			int quote = *buf++;
+
+			for (end = buf ; *end && *end != quote ; end++)
+				;
+			if (!*end)
+				return -EINVAL;	/* unclosed quote */
+		} else {
+			for (end = buf ; *end && !isspace(*end) ; end++)
+				;
+			BUG_ON(end == buf);
+		}
+		/* Here `buf' is the start of the word, `end' is one past the
+		 * end
+		 */
+
+		if (nwords == maxwords)
+			return -EINVAL;	/* ran out of words[] before bytes */
+		if (*end)
+			*end++ = '\0';	/* terminate the word */
+		words[nwords++] = buf;
+		buf = end;
+	}
+
+	return nwords;
+}
+
+/*
+ * Parse a range.
+ */
+static inline int parse_range(char *str,
+			      unsigned int *first,
+			      unsigned int *last)
+{
+	char *first_str = str;
+	char *last_str = strchr(first_str, '-');
+
+	if (last_str)
+		*last_str++ = '\0';
+
+	if (kstrtouint(first_str, 10, first))
+		return -EINVAL;
+
+	if (!last_str)
+		*last = *first;
+	else if (kstrtouint(last_str, 10, last))
+		return -EINVAL;
+
+	return 0;
+}
+
+enum dfault_token {
+	TOK_INVALID,
+
+	/* Queries */
+	TOK_FUNC,
+	TOK_FILE,
+	TOK_LINE,
+	TOK_MODULE,
+	TOK_CLASS,
+	TOK_INDEX,
+
+	/* Commands */
+	TOK_DISABLE,
+	TOK_ENABLE,
+	TOK_ONESHOT,
+	TOK_FREQUENCY,
+};
+
+static const struct {
+	const char		*str;
+	enum dfault_token	tok;
+	unsigned		args_required;
+} dfault_token_strs[] = {
+	{ "func",	TOK_FUNC,	1,	},
+	{ "file",	TOK_FILE,	1,	},
+	{ "line",	TOK_LINE,	1,	},
+	{ "module",	TOK_MODULE,	1,	},
+	{ "class",	TOK_CLASS,	1,	},
+	{ "index",	TOK_INDEX,	1,	},
+	{ "disable",	TOK_DISABLE,	0,	},
+	{ "enable",	TOK_ENABLE,	0,	},
+	{ "oneshot",	TOK_ONESHOT,	0,	},
+	{ "frequency",	TOK_FREQUENCY,	1,	},
+};
+
+static enum dfault_token str_to_token(const char *word, unsigned nr_words)
+{
+	unsigned i;
+
+	for (i = 0; i < ARRAY_SIZE(dfault_token_strs); i++)
+		if (!strcmp(word, dfault_token_strs[i].str)) {
+			if (nr_words < dfault_token_strs[i].args_required) {
+				pr_debug("insufficient arguments to \"%s\"",
+					 word);
+				return TOK_INVALID;
+			}
+
+			return dfault_token_strs[i].tok;
+		}
+
+	pr_debug("unknown keyword \"%s\"", word);
+
+	return TOK_INVALID;
+}
+
+static int dfault_parse_command(struct dfault_query *query,
+				enum dfault_token tok,
+				char *words[], size_t nr_words)
+{
+	unsigned i = 0;
+	int ret;
+
+	switch (tok) {
+	case TOK_INVALID:
+		return -EINVAL;
+	case TOK_FUNC:
+		query->function = words[i++];
+	case TOK_FILE:
+		query->filename = words[i++];
+		return 1;
+	case TOK_LINE:
+		ret = parse_range(words[i++],
+				  &query->first_line,
+				  &query->last_line);
+		if (ret)
+			return ret;
+		query->match_line = true;
+		break;
+	case TOK_MODULE:
+		query->module = words[i++];
+		break;
+	case TOK_CLASS:
+		query->class = words[i++];
+		break;
+	case TOK_INDEX:
+		ret = parse_range(words[i++],
+				  &query->first_index,
+				  &query->last_index);
+		if (ret)
+			return ret;
+		query->match_index = true;
+		break;
+	case TOK_DISABLE:
+		query->set_enabled = true;
+		query->enabled = DFAULT_DISABLED;
+		break;
+	case TOK_ENABLE:
+		query->set_enabled = true;
+		query->enabled = DFAULT_ENABLED;
+		break;
+	case TOK_ONESHOT:
+		query->set_enabled = true;
+		query->enabled = DFAULT_ONESHOT;
+		break;
+	case TOK_FREQUENCY:
+		query->set_frequency = 1;
+		ret = kstrtouint(words[i++], 10, &query->frequency);
+		if (ret)
+			return ret;
+
+		if (!query->set_enabled) {
+			query->set_enabled = 1;
+			query->enabled = DFAULT_ENABLED;
+		}
+		break;
+	}
+
+	return i;
+}
+
+/*
+ * Parse words[] as a dfault query specification, which is a series
+ * of (keyword, value) pairs chosen from these possibilities:
+ *
+ * func <function-name>
+ * file <full-pathname>
+ * file <base-filename>
+ * module <module-name>
+ * line <lineno>
+ * line <first-lineno>-<last-lineno> // where either may be empty
+ * index <m>-<n>                     // dynamic faults numbered from <m>
+ *                                   // to <n> inside each matching function
+ */
+static int dfault_parse_query(struct dfault_query *query,
+			      char *words[], size_t nr_words)
+{
+	unsigned i = 0;
+
+	while (i < nr_words) {
+		const char *tok_str = words[i++];
+		enum dfault_token tok = str_to_token(tok_str, nr_words - i);
+		int ret = dfault_parse_command(query, tok, words + i,
+					       nr_words - i);
+
+		if (ret < 0)
+			return ret;
+		i += ret;
+		BUG_ON(i > nr_words);
+	}
+
+	return 0;
+}
+
+/*
+ * File_ops->write method for <debugfs>/dynamic_fault/conrol.  Gathers the
+ * command text from userspace, parses and executes it.
+ */
+static ssize_t dfault_proc_write(struct file *file, const char __user *ubuf,
+				  size_t len, loff_t *offp)
+{
+	struct dfault_query query;
+#define MAXWORDS 9
+	int nwords;
+	char *words[MAXWORDS];
+	char tmpbuf[256];
+	int ret;
+
+	memset(&query, 0, sizeof(query));
+
+	if (len == 0)
+		return 0;
+	/* we don't check *offp -- multiple writes() are allowed */
+	if (len > sizeof(tmpbuf)-1)
+		return -E2BIG;
+	if (copy_from_user(tmpbuf, ubuf, len))
+		return -EFAULT;
+	tmpbuf[len] = '\0';
+
+	pr_debug("read %zu bytes from userspace", len);
+
+	nwords = dfault_tokenize(tmpbuf, words, MAXWORDS);
+	if (nwords < 0)
+		return -EINVAL;
+	if (dfault_parse_query(&query, words, nwords))
+		return -EINVAL;
+
+	/* actually go and implement the change */
+	ret = dfault_change(&query);
+	if (ret < 0)
+		return ret;
+
+	*offp += len;
+	return len;
+}
+
+/* Control file read code */
+
+/*
+ * Set the iterator to point to the first _dfault object
+ * and return a pointer to that first object.  Returns
+ * NULL if there are no _dfaults at all.
+ */
+static struct _dfault *dfault_iter_first(struct dfault_iter *iter)
+{
+	if (list_empty(&dfault_tables)) {
+		iter->table = NULL;
+		iter->idx = 0;
+		return NULL;
+	}
+	iter->table = list_entry(dfault_tables.next,
+				 struct dfault_table, link);
+	iter->idx = 0;
+	return &iter->table->dfaults[iter->idx];
+}
+
+/*
+ * Advance the iterator to point to the next _dfault
+ * object from the one the iterator currently points at,
+ * and returns a pointer to the new _dfault.  Returns
+ * NULL if the iterator has seen all the _dfaults.
+ */
+static struct _dfault *dfault_iter_next(struct dfault_iter *iter)
+{
+	if (iter->table == NULL)
+		return NULL;
+	if (++iter->idx == iter->table->num_dfaults) {
+		/* iterate to next table */
+		iter->idx = 0;
+		if (list_is_last(&iter->table->link, &dfault_tables)) {
+			iter->table = NULL;
+			return NULL;
+		}
+		iter->table = list_entry(iter->table->link.next,
+					 struct dfault_table, link);
+	}
+	return &iter->table->dfaults[iter->idx];
+}
+
+/*
+ * Seq_ops start method.  Called at the start of every
+ * read() call from userspace.  Takes the dfault_lock and
+ * seeks the seq_file's iterator to the given position.
+ */
+static void *dfault_proc_start(struct seq_file *m, loff_t *pos)
+{
+	struct dfault_iter *iter = m->private;
+	struct _dfault *dp;
+	int n = *pos;
+
+	mutex_lock(&dfault_lock);
+
+	if (n < 0)
+		return NULL;
+	dp = dfault_iter_first(iter);
+	while (dp != NULL && --n >= 0)
+		dp = dfault_iter_next(iter);
+	return dp;
+}
+
+/*
+ * Seq_ops next method.  Called several times within a read()
+ * call from userspace, with dfault_lock held.  Walks to the
+ * next _dfault object with a special case for the header line.
+ */
+static void *dfault_proc_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct dfault_iter *iter = m->private;
+	struct _dfault *dp;
+
+	if (p == SEQ_START_TOKEN)
+		dp = dfault_iter_first(iter);
+	else
+		dp = dfault_iter_next(iter);
+	++*pos;
+	return dp;
+}
+
+/*
+ * Seq_ops show method.  Called several times within a read()
+ * call from userspace, with dfault_lock held.  Formats the
+ * current _dfault as a single human-readable line, with a
+ * special case for the header line.
+ */
+static int dfault_proc_show(struct seq_file *m, void *p)
+{
+	struct dfault_iter *iter = m->private;
+	struct _dfault *df = p;
+	char flagsbuf[8];
+
+	seq_printf(m, "%s:%u class:%s module:%s func:%s %s \"\"\n",
+		   df->filename, df->line, df->class,
+		   iter->table->mod_name, df->function,
+		   dfault_describe_flags(df, flagsbuf, sizeof(flagsbuf)));
+
+	return 0;
+}
+
+/*
+ * Seq_ops stop method.  Called at the end of each read()
+ * call from userspace.  Drops dfault_lock.
+ */
+static void dfault_proc_stop(struct seq_file *m, void *p)
+{
+	mutex_unlock(&dfault_lock);
+}
+
+static const struct seq_operations dfault_proc_seqops = {
+	.start = dfault_proc_start,
+	.next = dfault_proc_next,
+	.show = dfault_proc_show,
+	.stop = dfault_proc_stop
+};
+
+/*
+ * File_ops->open method for <debugfs>/dynamic_fault/control.  Does the seq_file
+ * setup dance, and also creates an iterator to walk the _dfaults.
+ * Note that we create a seq_file always, even for O_WRONLY files
+ * where it's not needed, as doing so simplifies the ->release method.
+ */
+static int dfault_proc_open(struct inode *inode, struct file *file)
+{
+	struct dfault_iter *iter;
+	int err;
+
+	iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+	if (iter == NULL)
+		return -ENOMEM;
+
+	err = seq_open(file, &dfault_proc_seqops);
+	if (err) {
+		kfree(iter);
+		return err;
+	}
+	((struct seq_file *) file->private_data)->private = iter;
+	return 0;
+}
+
+static const struct file_operations dfault_proc_fops = {
+	.owner = THIS_MODULE,
+	.open = dfault_proc_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release_private,
+	.write = dfault_proc_write
+};
+
+/*
+ * Allocate a new dfault_table for the given module
+ * and add it to the global list.
+ */
+int dfault_add_module(struct _dfault *tab, unsigned int n,
+		      const char *name)
+{
+	struct dfault_table *dt;
+	char *new_name;
+	const char *func = NULL;
+	int i;
+
+	dt = kzalloc(sizeof(*dt), GFP_KERNEL);
+	if (dt == NULL)
+		return -ENOMEM;
+	new_name = kstrdup(name, GFP_KERNEL);
+	if (new_name == NULL) {
+		kfree(dt);
+		return -ENOMEM;
+	}
+	dt->mod_name = new_name;
+	dt->num_dfaults = n;
+	dt->dfaults = tab;
+
+	mutex_lock(&dfault_lock);
+	list_add_tail(&dt->link, &dfault_tables);
+	mutex_unlock(&dfault_lock);
+
+	/* __attribute__(("section")) emits things in reverse order */
+	for (i = n - 1; i >= 0; i--)
+		if (!func || strcmp(tab[i].function, func))
+			func = tab[i].function;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dfault_add_module);
+
+static void dfault_table_free(struct dfault_table *dt)
+{
+	list_del_init(&dt->link);
+	kfree(dt->mod_name);
+	kfree(dt);
+}
+
+/*
+ * Called in response to a module being unloaded.  Removes
+ * any dfault_table's which point at the module.
+ */
+int dfault_remove_module(char *mod_name)
+{
+	struct dfault_table *dt, *nextdt;
+	int ret = -ENOENT;
+
+	mutex_lock(&dfault_lock);
+	list_for_each_entry_safe(dt, nextdt, &dfault_tables, link) {
+		if (!strcmp(dt->mod_name, mod_name)) {
+			dfault_table_free(dt);
+			ret = 0;
+		}
+	}
+	mutex_unlock(&dfault_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dfault_remove_module);
+
+static void dfault_remove_all_tables(void)
+{
+	mutex_lock(&dfault_lock);
+	while (!list_empty(&dfault_tables)) {
+		struct dfault_table *dt = list_entry(dfault_tables.next,
+						      struct dfault_table,
+						      link);
+		dfault_table_free(dt);
+	}
+	mutex_unlock(&dfault_lock);
+}
+
+static int __init dynamic_fault_init(void)
+{
+	struct dentry *dir, *file;
+	struct _dfault *iter, *iter_start;
+	const char *modname = NULL;
+	int ret = 0;
+	int n = 0;
+
+	dir = debugfs_create_dir("dynamic_fault", NULL);
+	if (!dir)
+		return -ENOMEM;
+	file = debugfs_create_file("control", 0644, dir, NULL,
+					&dfault_proc_fops);
+	if (!file) {
+		debugfs_remove(dir);
+		return -ENOMEM;
+	}
+	if (__start___faults != __stop___faults) {
+		iter = __start___faults;
+		modname = iter->modname;
+		iter_start = iter;
+		for (; iter < __stop___faults; iter++) {
+			if (strcmp(modname, iter->modname)) {
+				ret = dfault_add_module(iter_start, n, modname);
+				if (ret)
+					goto out_free;
+				n = 0;
+				modname = iter->modname;
+				iter_start = iter;
+			}
+			n++;
+		}
+		ret = dfault_add_module(iter_start, n, modname);
+	}
+out_free:
+	if (ret) {
+		dfault_remove_all_tables();
+		debugfs_remove(dir);
+		debugfs_remove(file);
+	}
+	return 0;
+}
+module_init(dynamic_fault_init);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/10] RFC: assorted bcachefs patches
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (9 preceding siblings ...)
  2018-05-18  7:49 ` [PATCH 10/10] Dynamic fault injection Kent Overstreet
@ 2018-05-18  7:55 ` Kent Overstreet
  2018-05-18 17:45 ` Josef Bacik
  11 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18  7:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, darrick.wong, tytso, linux-btrfs,
	clm, jbacik, viro, willy, peterz

shit, git send-email pebkac error - disregard all the block patches in the
thread

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/10] locking: bring back lglocks
  2018-05-18  7:49 ` [PATCH 03/10] locking: bring back lglocks Kent Overstreet
@ 2018-05-18  9:51   ` Peter Zijlstra
  2018-05-18 10:13     ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2018-05-18  9:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> bcachefs makes use of them - also, add a proper lg_lock_init()

Why?! lglocks are horrid things, we got rid of them for a reason. They
have terrifying worst case preemption off latencies.

Why can't you use something like per-cpu rwsems?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18  7:49 ` [PATCH 04/10] locking: export osq_lock()/osq_unlock() Kent Overstreet
@ 2018-05-18  9:52   ` Peter Zijlstra
  2018-05-18 10:18     ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2018-05-18  9:52 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:

No.. and most certainly not without a _very_ good reason.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/10] locking: bring back lglocks
  2018-05-18  9:51   ` Peter Zijlstra
@ 2018-05-18 10:13     ` Kent Overstreet
  2018-05-18 11:03       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > bcachefs makes use of them - also, add a proper lg_lock_init()
> 
> Why?! lglocks are horrid things, we got rid of them for a reason. They
> have terrifying worst case preemption off latencies.

Ah. That was missing from your commit message.

> Why can't you use something like per-cpu rwsems?

Well,

 a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
    it's only called when starting mark and sweep gc (which is not needed
    anymore and disabled by default, it'll eventually get rolled into online
    fsck) and for device resize

 b) I'm using it in conjection with percpu counters, and technically yes I
    certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
    for me than rwsems), there's a bit of a clash there and it's going to be a
    bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
    makes any sense in that case, so it'd mean auditing and converting all the
    code that touches the relevant data structures).

If you're really that dead set against lglocks I might just come up with a new
lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
global lock is held...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18  9:52   ` Peter Zijlstra
@ 2018-05-18 10:18     ` Kent Overstreet
  2018-05-18 11:08       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> 
> No.. and most certainly not without a _very_ good reason.

Ok, can I ask why?

Here's what it's for:

commit 61782bf71eef83919af100a9747d8d86dfdf3605
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date:   Fri May 18 06:14:56 2018 -0400

    bcachefs: SIX locks (shared/intent/exclusive)
    
    New lock for bcachefs, like read/write locks but with a third state,
    intent.
    
    Intent locks conflict with each other, but not with read locks; taking a
    write lock requires first holding an intent lock.

diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
new file mode 100644
index 0000000000..afa59a476a
--- /dev/null
+++ b/fs/bcachefs/six.c
@@ -0,0 +1,516 @@
+
+#include <linux/log2.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+
+#include "six.h"
+
+#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l)		lock_release(l, 0, _RET_IP_)
+
+struct six_lock_vals {
+	/* Value we add to the lock in order to take the lock: */
+	u64			lock_val;
+
+	/* If the lock has this value (used as a mask), taking the lock fails: */
+	u64			lock_fail;
+
+	/* Value we add to the lock in order to release the lock: */
+	u64			unlock_val;
+
+	/* Mask that indicates lock is held for this type: */
+	u64			held_mask;
+
+	/* Waitlist we wakeup when releasing the lock: */
+	enum six_lock_type	unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
+
+#define LOCK_VALS {							\
+	[SIX_LOCK_read] = {						\
+		.lock_val	= __SIX_VAL(read_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_write,		\
+		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_read,			\
+		.unlock_wakeup	= SIX_LOCK_write,			\
+	},								\
+	[SIX_LOCK_intent] = {						\
+		.lock_val	= __SIX_VAL(intent_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_intent,		\
+		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_intent,		\
+		.unlock_wakeup	= SIX_LOCK_intent,			\
+	},								\
+	[SIX_LOCK_write] = {						\
+		.lock_val	= __SIX_VAL(seq, 1),			\
+		.lock_fail	= __SIX_LOCK_HELD_read,			\
+		.unlock_val	= __SIX_VAL(seq, 1),			\
+		.held_mask	= __SIX_LOCK_HELD_write,		\
+		.unlock_wakeup	= SIX_LOCK_read,			\
+	},								\
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+				 union six_lock_state old)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	if (!old.intent_lock) {
+		EBUG_ON(lock->owner);
+		lock->owner = current;
+	} else {
+		EBUG_ON(lock->owner != current);
+	}
+}
+
+static inline void six_clear_owner(struct six_lock *lock, enum six_lock_type type)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	EBUG_ON(lock->owner != current);
+
+	if (lock->state.intent_lock == 1)
+		lock->owner = NULL;
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+						enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+
+	do {
+		old.v = v;
+
+		EBUG_ON(type == SIX_LOCK_write &&
+			((old.v & __SIX_LOCK_HELD_write) ||
+			 !(old.v & __SIX_LOCK_HELD_intent)));
+
+		if (old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	if (!do_six_trylock_type(lock, type))
+		return false;
+
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+			      unsigned seq)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		old.v = v;
+
+		if (old.seq != seq || old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+struct six_lock_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+	struct task_struct *owner;
+	int retval = 1;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = READ_ONCE(lock->owner);
+	if (owner)
+		retval = owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the mutex owner may have just acquired
+	 * it and not set the owner yet or the mutex has been released.
+	 */
+	return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+				     struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (lock->owner == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	struct task_struct *task = current;
+
+	if (type == SIX_LOCK_write)
+		return false;
+
+	preempt_disable();
+	if (!six_can_spin_on_owner(lock))
+		goto fail;
+
+	if (!osq_lock(&lock->osq))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = READ_ONCE(lock->owner);
+		if (owner && !six_spin_on_owner(lock, owner))
+			break;
+
+		if (do_six_trylock_type(lock, type)) {
+			osq_unlock(&lock->osq);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(&lock->osq);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+#endif
+
+noinline
+static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	struct six_lock_waiter wait;
+	u64 v;
+
+	if (six_optimistic_spin(lock, type))
+		return;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
+	INIT_LIST_HEAD(&wait.list);
+	wait.task = current;
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (type == SIX_LOCK_write)
+			EBUG_ON(lock->owner != current);
+		else if (list_empty_careful(&wait.list)) {
+			raw_spin_lock(&lock->wait_lock);
+			list_add_tail(&wait.list, &lock->wait_list[type]);
+			raw_spin_unlock(&lock->wait_lock);
+		}
+
+		v = READ_ONCE(lock->state.v);
+		do {
+			new.v = old.v = v;
+
+			if (!(old.v & l[type].lock_fail))
+				new.v += l[type].lock_val;
+			else if (!(new.waiters & (1 << type)))
+				new.waiters |= 1 << type;
+			else
+				break; /* waiting bit already set */
+		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+					old.v, new.v)) != old.v);
+
+		if (!(old.v & l[type].lock_fail))
+			break;
+
+		schedule();
+	}
+
+	six_set_owner(lock, type, old);
+
+	__set_current_state(TASK_RUNNING);
+
+	if (!list_empty_careful(&wait.list)) {
+		raw_spin_lock(&lock->wait_lock);
+		list_del_init(&wait.list);
+		raw_spin_unlock(&lock->wait_lock);
+	}
+}
+
+__always_inline
+static void __six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	six_acquire(&lock->dep_map, 0);
+
+	if (!do_six_trylock_type(lock, type))
+		__six_lock_type_slowpath(lock, type);
+
+	lock_acquired(&lock->dep_map, _RET_IP_);
+}
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+				   union six_lock_state state,
+				   unsigned waitlist_id)
+{
+	struct list_head *wait_list = &lock->wait_list[waitlist_id];
+	struct six_lock_waiter *w, *next;
+
+	if (waitlist_id == SIX_LOCK_write && state.read_lock)
+		return;
+
+	if (!(state.waiters & (1 << waitlist_id)))
+		return;
+
+	clear_bit(waitlist_bitnr(waitlist_id),
+		  (unsigned long *) &lock->state.v);
+
+	if (waitlist_id == SIX_LOCK_write) {
+		struct task_struct *p = READ_ONCE(lock->owner);
+
+		if (p)
+			wake_up_process(p);
+		return;
+	}
+
+	raw_spin_lock(&lock->wait_lock);
+
+	list_for_each_entry_safe(w, next, wait_list, list) {
+		list_del_init(&w->list);
+
+		if (wake_up_process(w->task) &&
+		    waitlist_id != SIX_LOCK_read) {
+			if (!list_empty(wait_list))
+				set_bit(waitlist_bitnr(waitlist_id),
+					(unsigned long *) &lock->state.v);
+			break;
+		}
+	}
+
+	raw_spin_unlock(&lock->wait_lock);
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state state;
+
+	EBUG_ON(!(lock->state.v & l[type].held_mask));
+	EBUG_ON(type == SIX_LOCK_write &&
+		!(lock->state.v & __SIX_LOCK_HELD_intent));
+
+	six_clear_owner(lock, type);
+
+	state.v = atomic64_add_return_release(l[type].unlock_val,
+					      &lock->state.counter);
+	six_release(&lock->dep_map);
+	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#ifdef SIX_LOCK_SEPARATE_LOCKFNS
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *lock)				\
+{									\
+	return __six_trylock_type(lock, SIX_LOCK_##type);		\
+}									\
+									\
+bool six_relock_##type(struct six_lock *lock, u32 seq)			\
+{									\
+	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+									\
+void six_lock_##type(struct six_lock *lock)				\
+{									\
+	__six_lock_type(lock, SIX_LOCK_##type);				\
+}									\
+									\
+void six_unlock_##type(struct six_lock *lock)				\
+{									\
+	__six_unlock_type(lock, SIX_LOCK_##type);			\
+}
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+#else
+
+bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	return __six_trylock_type(lock, type);
+}
+
+bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+		     unsigned seq)
+{
+	return __six_relock_type(lock, type, seq);
+
+}
+
+void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	__six_lock_type(lock, type);
+}
+
+void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	__six_unlock_type(lock, type);
+}
+
+#endif
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+	six_lock_increment(lock, SIX_LOCK_read);
+	six_unlock_intent(lock);
+}
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		new.v = old.v = v;
+
+		EBUG_ON(!(old.v & l[SIX_LOCK_read].held_mask));
+
+		new.v += l[SIX_LOCK_read].unlock_val;
+
+		if (new.v & l[SIX_LOCK_intent].lock_fail)
+			return false;
+
+		new.v += l[SIX_LOCK_intent].lock_val;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v, new.v)) != old.v);
+
+	six_set_owner(lock, SIX_LOCK_intent, old);
+	six_lock_wakeup(lock, new, l[SIX_LOCK_read].unlock_wakeup);
+
+	return true;
+}
+
+bool six_trylock_convert(struct six_lock *lock,
+			 enum six_lock_type from,
+			 enum six_lock_type to)
+{
+	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+	if (to == from)
+		return true;
+
+	if (to == SIX_LOCK_read) {
+		six_lock_downgrade(lock);
+		return true;
+	} else {
+		return six_lock_tryupgrade(lock);
+	}
+}
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+
+	EBUG_ON(type == SIX_LOCK_write);
+	six_acquire(&lock->dep_map, 0);
+
+	/* XXX: assert already locked, and that we don't overflow: */
+
+	atomic64_add(l[type].lock_val, &lock->state.counter);
+}
diff --git a/fs/bcachefs/six.h b/fs/bcachefs/six.h
new file mode 100644
index 0000000000..f518c64c40
--- /dev/null
+++ b/fs/bcachefs/six.h
@@ -0,0 +1,190 @@
+#ifndef _BCACHEFS_SIX_H
+#define _BCACHEFS_SIX_H
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#include "util.h"
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+/*
+ * LOCK STATES:
+ *
+ * read, intent, write (i.e. shared/intent/exclusive, hence the name)
+ *
+ * read and write work as with normal read/write locks - a lock can have
+ * multiple readers, but write excludes reads and other write locks.
+ *
+ * Intent does not block read, but it does block other intent locks. The idea is
+ * by taking an intent lock, you can then later upgrade to a write lock without
+ * dropping your read lock and without deadlocking - because no other thread has
+ * the intent lock and thus no other thread could be trying to take the write
+ * lock.
+ */
+
+union six_lock_state {
+	struct {
+		atomic64_t	counter;
+	};
+
+	struct {
+		u64		v;
+	};
+
+	struct {
+		/* for waitlist_bitnr() */
+		unsigned long	l;
+	};
+
+	struct {
+		unsigned	read_lock:26;
+		unsigned	intent_lock:3;
+		unsigned	waiters:3;
+		/*
+		 * seq works much like in seqlocks: it's incremented every time
+		 * we lock and unlock for write.
+		 *
+		 * If it's odd write lock is held, even unlocked.
+		 *
+		 * Thus readers can unlock, and then lock again later iff it
+		 * hasn't been modified in the meantime.
+		 */
+		u32		seq;
+	};
+};
+
+#define SIX_LOCK_MAX_RECURSE	((1 << 3) - 1)
+
+enum six_lock_type {
+	SIX_LOCK_read,
+	SIX_LOCK_intent,
+	SIX_LOCK_write,
+};
+
+struct six_lock {
+	union six_lock_state	state;
+	struct task_struct	*owner;
+	struct optimistic_spin_queue osq;
+
+	raw_spinlock_t		wait_lock;
+	struct list_head	wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+					    const char *name,
+					    struct lock_class_key *key)
+{
+	atomic64_set(&lock->state.counter, 0);
+	raw_spin_lock_init(&lock->wait_lock);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__six_lock_init((lock), #lock, &__key);				\
+} while (0)
+
+#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
+
+#ifdef SIX_LOCK_SEPARATE_LOCKFNS
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *);				\
+bool six_relock_##type(struct six_lock *, u32);				\
+void six_lock_##type(struct six_lock *);				\
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...)			\
+	switch (type) {						\
+	case SIX_LOCK_read:					\
+		return fn##_read(__VA_ARGS__);			\
+	case SIX_LOCK_intent:					\
+		return fn##_intent(__VA_ARGS__);		\
+	case SIX_LOCK_write:					\
+		return fn##_write(__VA_ARGS__);			\
+	default:						\
+		BUG();						\
+	}
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+		     unsigned seq)
+{
+	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_lock, lock);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+#else
+
+bool six_trylock_type(struct six_lock *, enum six_lock_type);
+bool six_relock_type(struct six_lock *, enum six_lock_type, unsigned);
+void six_lock_type(struct six_lock *, enum six_lock_type);
+void six_unlock_type(struct six_lock *, enum six_lock_type);
+
+#define __SIX_LOCK(type)						\
+static __always_inline bool six_trylock_##type(struct six_lock *lock)	\
+{									\
+	return six_trylock_type(lock, SIX_LOCK_##type);			\
+}									\
+									\
+static __always_inline bool six_relock_##type(struct six_lock *lock, u32 seq)\
+{									\
+	return six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+									\
+static __always_inline void six_lock_##type(struct six_lock *lock)	\
+{									\
+	six_lock_type(lock, SIX_LOCK_##type);				\
+}									\
+									\
+static __always_inline void six_unlock_##type(struct six_lock *lock)	\
+{									\
+	six_unlock_type(lock, SIX_LOCK_##type);				\
+}
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#endif
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+			 enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+#endif /* _BCACHEFS_SIX_H */

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/10] locking: bring back lglocks
  2018-05-18 10:13     ` Kent Overstreet
@ 2018-05-18 11:03       ` Peter Zijlstra
  2018-05-18 11:39         ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2018-05-18 11:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > 
> > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > have terrifying worst case preemption off latencies.
> 
> Ah. That was missing from your commit message.

Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/

> > Why can't you use something like per-cpu rwsems?
> 
> Well,
> 
>  a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
>     it's only called when starting mark and sweep gc (which is not needed
>     anymore and disabled by default, it'll eventually get rolled into online
>     fsck) and for device resize
> 
>  b) I'm using it in conjection with percpu counters, and technically yes I
>     certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
>     for me than rwsems), there's a bit of a clash there and it's going to be a
>     bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
>     makes any sense in that case, so it'd mean auditing and converting all the
>     code that touches the relevant data structures).

Well, lg is a reader-writer style lock per definition, as you want
concurrency on the local and full exclusion against the global, so I'm
not sure how mutexes fit into this.

In any case, have a look at percpu_down_read_preempt_disable() and
percpu_up_read_preempt_enable(); they're a bit of a hack but they should
work for you I think.

They will sleep at down_read, but the entire actual critical section
will be with preemption disabled -- therefore it had better be short and
bounded, and the RT guys will thank you for not using spinlock_t under
it (use raw_spinlock_t if you have to).

The (global) writer side will block and be preemptible like normal.

> If you're really that dead set against lglocks I might just come up with a new
> lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
> global lock is held...

See above, we already have this ;-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18 10:18     ` Kent Overstreet
@ 2018-05-18 11:08       ` Peter Zijlstra
  2018-05-18 11:32         ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2018-05-18 11:08 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > 
> > No.. and most certainly not without a _very_ good reason.
> 
> Ok, can I ask why?

Because it is an internal helper for lock implementations that want to
do optimistic spinning, it isn't a lock on its own and lacks several
things you would expect.

Using it is tricky and I don't trust random module authors to get 1+1
right, let alone use this thing correctly (no judgement on your code,
just in general).

> Here's what it's for:

I'll try and have a look soon :-) But does that really _have_ to live in
a module?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18 11:08       ` Peter Zijlstra
@ 2018-05-18 11:32         ` Kent Overstreet
  2018-05-18 11:40           ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 11:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 01:08:08PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > > 
> > > No.. and most certainly not without a _very_ good reason.
> > 
> > Ok, can I ask why?
> 
> Because it is an internal helper for lock implementations that want to
> do optimistic spinning, it isn't a lock on its own and lacks several
> things you would expect.
> 
> Using it is tricky and I don't trust random module authors to get 1+1
> right, let alone use this thing correctly (no judgement on your code,
> just in general).

Yeah, that's true. I just modelled my usage on the rwsem code.

It does strike me that the whole optimistic spin algorithm
(mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
out. They've been growing more optimizations I see, and the optimizations mostly
aren't specific to either locks.

> > Here's what it's for:
> 
> I'll try and have a look soon :-) But does that really _have_ to live in
> a module?

No, I'd be completely fine with moving six locks out of bcachefs, just don't
know that there'd be any other users. But I suppose we do have other filesystems
that use btrees, and that's what they're for.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/10] locking: bring back lglocks
  2018-05-18 11:03       ` Peter Zijlstra
@ 2018-05-18 11:39         ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 01:03:39PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > > 
> > > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > > have terrifying worst case preemption off latencies.
> > 
> > Ah. That was missing from your commit message.
> 
> Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/
> 
> > > Why can't you use something like per-cpu rwsems?
> > 
> > Well,
> > 
> >  a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
> >     it's only called when starting mark and sweep gc (which is not needed
> >     anymore and disabled by default, it'll eventually get rolled into online
> >     fsck) and for device resize
> > 
> >  b) I'm using it in conjection with percpu counters, and technically yes I
> >     certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
> >     for me than rwsems), there's a bit of a clash there and it's going to be a
> >     bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> >     makes any sense in that case, so it'd mean auditing and converting all the
> >     code that touches the relevant data structures).
> 
> Well, lg is a reader-writer style lock per definition, as you want
> concurrency on the local and full exclusion against the global, so I'm
> not sure how mutexes fit into this.
> 
> In any case, have a look at percpu_down_read_preempt_disable() and
> percpu_up_read_preempt_enable(); they're a bit of a hack but they should
> work for you I think.
> 
> They will sleep at down_read, but the entire actual critical section
> will be with preemption disabled -- therefore it had better be short and
> bounded, and the RT guys will thank you for not using spinlock_t under
> it (use raw_spinlock_t if you have to).
> 
> The (global) writer side will block and be preemptible like normal.
> 
> > If you're really that dead set against lglocks I might just come up with a new
> > lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
> > global lock is held...
> 
> See above, we already have this ;-)

Ok, I think this might work. I'll have to stare awhile and make sure I remember
everything I'm currently depending on the lglock for...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18 11:32         ` Kent Overstreet
@ 2018-05-18 11:40           ` Peter Zijlstra
  2018-05-18 12:40             ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2018-05-18 11:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> It does strike me that the whole optimistic spin algorithm
> (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> out. They've been growing more optimizations I see, and the optimizations mostly
> aren't specific to either locks.

There's a few unfortunate differences between the two; but yes it would
be good if we could reduce some of the duplication found there.

One of the distinct differences is that mutex (now) has the lock state
and owner in a single atomic word, while rwsem still tracks the owner in
a separate word and thus needs to account for 'inconsistent' owner
state.

And then there's warts such as ww_mutex and rwsem_owner_is_reader and
similar.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()
  2018-05-18 11:40           ` Peter Zijlstra
@ 2018-05-18 12:40             ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy

On Fri, May 18, 2018 at 01:40:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> > It does strike me that the whole optimistic spin algorithm
> > (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> > out. They've been growing more optimizations I see, and the optimizations mostly
> > aren't specific to either locks.
> 
> There's a few unfortunate differences between the two; but yes it would
> be good if we could reduce some of the duplication found there.
> 
> One of the distinct differences is that mutex (now) has the lock state
> and owner in a single atomic word, while rwsem still tracks the owner in
> a separate word and thus needs to account for 'inconsistent' owner
> state.
> 
> And then there's warts such as ww_mutex and rwsem_owner_is_reader and
> similar.

The rwsem code seems workable, osq_optimistic_spin() takes callback for trylock
and get_owner - I just added a OWNER_NO_SPIN value that get_owner() can return.

The mutex code though... wtf...


commit 5c7defe81ee722f5087cbeda9be6e7ee715515d7
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date:   Fri May 18 08:35:55 2018 -0400

    Factor out osq_optimistic_spin()

diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
index afa59a476a..dbaf52abef 100644
--- a/fs/bcachefs/six.c
+++ b/fs/bcachefs/six.c
@@ -1,5 +1,6 @@
 
 #include <linux/log2.h>
+#include <linux/osq_optimistic_spin.h>
 #include <linux/preempt.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -146,127 +147,26 @@ struct six_lock_waiter {
 /* This is probably up there with the more evil things I've done */
 #define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
 
-#ifdef CONFIG_LOCK_SPIN_ON_OWNER
-
-static inline int six_can_spin_on_owner(struct six_lock *lock)
-{
-	struct task_struct *owner;
-	int retval = 1;
-
-	if (need_resched())
-		return 0;
-
-	rcu_read_lock();
-	owner = READ_ONCE(lock->owner);
-	if (owner)
-		retval = owner->on_cpu;
-	rcu_read_unlock();
-	/*
-	 * if lock->owner is not set, the mutex owner may have just acquired
-	 * it and not set the owner yet or the mutex has been released.
-	 */
-	return retval;
-}
-
-static inline bool six_spin_on_owner(struct six_lock *lock,
-				     struct task_struct *owner)
+static inline struct task_struct *six_osq_get_owner(struct optimistic_spin_queue *osq)
 {
-	bool ret = true;
-
-	rcu_read_lock();
-	while (lock->owner == owner) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking lock->owner still matches owner. If that fails,
-		 * owner might point to freed memory. If it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		if (!owner->on_cpu || need_resched()) {
-			ret = false;
-			break;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
+	struct six_lock *lock = container_of(osq, struct six_lock, osq);
 
-	return ret;
+	return lock->owner;
 }
 
-static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+static inline bool six_osq_trylock_read(struct optimistic_spin_queue *osq)
 {
-	struct task_struct *task = current;
-
-	if (type == SIX_LOCK_write)
-		return false;
-
-	preempt_disable();
-	if (!six_can_spin_on_owner(lock))
-		goto fail;
-
-	if (!osq_lock(&lock->osq))
-		goto fail;
-
-	while (1) {
-		struct task_struct *owner;
-
-		/*
-		 * If there's an owner, wait for it to either
-		 * release the lock or go to sleep.
-		 */
-		owner = READ_ONCE(lock->owner);
-		if (owner && !six_spin_on_owner(lock, owner))
-			break;
+	struct six_lock *lock = container_of(osq, struct six_lock, osq);
 
-		if (do_six_trylock_type(lock, type)) {
-			osq_unlock(&lock->osq);
-			preempt_enable();
-			return true;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!owner && (need_resched() || rt_task(task)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-
-	osq_unlock(&lock->osq);
-fail:
-	preempt_enable();
-
-	/*
-	 * If we fell out of the spin path because of need_resched(),
-	 * reschedule now, before we try-lock again. This avoids getting
-	 * scheduled out right after we obtained the lock.
-	 */
-	if (need_resched())
-		schedule();
-
-	return false;
+	return do_six_trylock_type(lock, SIX_LOCK_read);
 }
 
-#else /* CONFIG_LOCK_SPIN_ON_OWNER */
-
-static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+static inline bool six_osq_trylock_intent(struct optimistic_spin_queue *osq)
 {
-	return false;
-}
+	struct six_lock *lock = container_of(osq, struct six_lock, osq);
 
-#endif
+	return do_six_trylock_type(lock, SIX_LOCK_intent);
+}
 
 noinline
 static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
@@ -276,8 +176,20 @@ static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type t
 	struct six_lock_waiter wait;
 	u64 v;
 
-	if (six_optimistic_spin(lock, type))
-		return;
+	switch (type) {
+	case SIX_LOCK_read:
+		if (osq_optimistic_spin(&lock->osq, six_osq_get_owner,
+					six_osq_trylock_read))
+			return;
+		break;
+	case SIX_LOCK_intent:
+		if (osq_optimistic_spin(&lock->osq, six_osq_get_owner,
+					six_osq_trylock_intent))
+			return;
+		break;
+	case SIX_LOCK_write:
+		break;
+	}
 
 	lock_contended(&lock->dep_map, _RET_IP_);
 
diff --git a/include/linux/osq_optimistic_spin.h b/include/linux/osq_optimistic_spin.h
new file mode 100644
index 0000000000..1f5fd1f85b
--- /dev/null
+++ b/include/linux/osq_optimistic_spin.h
@@ -0,0 +1,155 @@
+#ifndef __LINUX_OSQ_OPTIMISTIC_SPIN_H
+#define __LINUX_OSQ_OPTIMISTIC_SPIN_H
+
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+typedef struct task_struct *(*osq_get_owner_fn)(struct optimistic_spin_queue *osq);
+typedef bool (*osq_trylock_fn)(struct optimistic_spin_queue *osq);
+
+#define OWNER_NO_SPIN		((struct task_struct *) 1UL)
+
+static inline bool osq_can_spin_on_owner(struct optimistic_spin_queue *lock,
+					 osq_get_owner_fn get_owner)
+{
+	struct task_struct *owner;
+	bool ret;
+
+	if (need_resched())
+		return false;
+
+	rcu_read_lock();
+	owner = get_owner(lock);
+	/*
+	 * if lock->owner is not set, the lock owner may have just acquired
+	 * it and not set the owner yet, or it may have just been unlocked
+	 */
+	if (!owner)
+		ret = true;
+	else if (owner == OWNER_NO_SPIN)
+		ret = false;
+	else
+		ret = owner->on_cpu != 0;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool osq_spin_on_owner(struct optimistic_spin_queue *lock,
+				     struct task_struct *owner,
+				     osq_get_owner_fn get_owner)
+{
+	if (!owner)
+		return true;
+
+	if (owner == OWNER_NO_SPIN)
+		return false;
+
+	while (1) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+		if (get_owner(lock) != owner)
+			return true;
+
+		if (!owner->on_cpu || need_resched() ||
+		    vcpu_is_preempted(task_cpu(owner)))
+			return false;
+
+		cpu_relax();
+	}
+}
+
+static inline bool osq_optimistic_spin(struct optimistic_spin_queue *lock,
+				       osq_get_owner_fn get_owner,
+				       osq_trylock_fn trylock)
+{
+	struct task_struct *task = current;
+
+	preempt_disable();
+	if (!osq_can_spin_on_owner(lock, get_owner))
+		goto fail;
+
+	if (!osq_lock(lock))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		rcu_read_lock();
+		owner = get_owner(lock);
+		if (!osq_spin_on_owner(lock, owner, get_owner)) {
+			rcu_read_unlock();
+			break;
+		}
+		rcu_read_unlock();
+
+		if (trylock(lock)) {
+			osq_unlock(lock);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(lock);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+static inline bool osq_has_spinner(struct optimistic_spin_queue *lock)
+{
+	return osq_is_locked(lock);
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool osq_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+static inline bool osq_has_spinner(struct optimistic_spin_queue *lock)
+{
+	return false;
+}
+
+#endif
+
+#endif /* __LINUX_OSQ_OPTIMISTIC_SPIN_H */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index e795908f36..a25ee6668f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/init.h>
 #include <linux/export.h>
+#include <linux/osq_optimistic_spin.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/rt.h>
 #include <linux/sched/wake_q.h>
@@ -325,11 +326,21 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+
+static inline struct task_struct *rwsem_osq_get_owner(struct optimistic_spin_queue *osq)
+{
+	struct rw_semaphore *sem = container_of(osq, struct rw_semaphore, osq);
+	struct task_struct *owner = READ_ONCE(sem->owner);
+
+	return rwsem_owner_is_reader(owner) ? OWNER_NO_SPIN : owner;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_osq_trylock(struct optimistic_spin_queue *osq)
 {
+	struct rw_semaphore *sem = container_of(osq, struct rw_semaphore, osq);
 	long old, count = atomic_long_read(&sem->count);
 
 	while (true) {
@@ -347,125 +358,10 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 	}
 }
 
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner;
-	bool ret = true;
-
-	if (need_resched())
-		return false;
-
-	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (!rwsem_owner_is_writer(owner)) {
-		/*
-		 * Don't spin if the rwsem is readers owned.
-		 */
-		ret = !rwsem_owner_is_reader(owner);
-		goto done;
-	}
-
-	/*
-	 * As lock holder preemption issue, we both skip spinning if task is not
-	 * on cpu or its cpu is preempted
-	 */
-	ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-done:
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner = READ_ONCE(sem->owner);
-
-	if (!rwsem_owner_is_writer(owner))
-		goto out;
-
-	rcu_read_lock();
-	while (sem->owner == owner) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
-		if (!owner->on_cpu || need_resched() ||
-				vcpu_is_preempted(task_cpu(owner))) {
-			rcu_read_unlock();
-			return false;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
-out:
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return !rwsem_owner_is_reader(READ_ONCE(sem->owner));
-}
-
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
-	bool taken = false;
-
-	preempt_disable();
-
-	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
-	if (!osq_lock(&sem->osq))
-		goto done;
-
-	/*
-	 * Optimistically spin on the owner field and attempt to acquire the
-	 * lock whenever the owner changes. Spinning will be stopped when:
-	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
-	 */
-	while (rwsem_spin_on_owner(sem)) {
-		/*
-		 * Try to acquire the lock
-		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
-			break;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-	osq_unlock(&sem->osq);
-done:
-	preempt_enable();
-	return taken;
+	return osq_optimistic_spin(&sem->osq, rwsem_osq_get_owner,
+				   rwsem_osq_trylock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-18  7:49 ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
@ 2018-05-18 13:13   ` Matthew Wilcox
  2018-05-18 15:53     ` Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2018-05-18 13:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, peterz

On Fri, May 18, 2018 at 03:49:00AM -0400, Kent Overstreet wrote:
> Add a per address space lock around adding pages to the pagecache - making it
> possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
> hopefully making truncate and dio a bit saner.

(moving this section here from the overall description so I can reply
to it in one place)

>  * pagecache add lock
> 
> This is the only one that touches existing code in nontrivial ways.
> The problem it's solving is that there is no existing general mechanism
> for shooting down pages in the page and keeping them removed, which is a
> real problem if you're doing anything that modifies file data and isn't
> buffered writes.
> 
> Historically, the only problematic case has been direct IO, and people
> have been willing to say "well, if you mix buffered and direct IO you
> get what you deserve", and that's probably not unreasonable. But now we
> have fallocate insert range and collapse range, and those are broken in
> ways I frankly don't want to think about if they can't ensure consistency
> with the page cache.

ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
You may get pushback on the grounds that this ought to be a
filesystem-specific lock rather than one embedded in the generic inode.

> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we
> switched it to a more general rigorous mechanism.
> 
> I need this solved for bcachefs because without this mechanism, the page
> cache inconsistencies lead to various assertions popping (primarily when
> we didn't think we need to get a disk reservation going by page cache
> state, but then do the actual write and disk space accounting says oops,
> we did need one). And having to reason about what can happen without
> a locking mechanism for this is not something I care to spend brain
> cycles on.
> 
> That said, my patch is kind of ugly, and it requires filesystem changes
> for other filesystems to take advantage of it. And unfortunately, since
> one of the code paths that needs locking is readahead, I don't see any
> realistic way of implementing the locking within just bcachefs code.
> 
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this),
> but if not I'll polish up my pagecache add lock patch and see what I can
> do to make it less ugly, and hopefully other people find it palatable
> or at least useful.

My idea with the XArray is that we have a number of reserved entries which
we can use as blocking entries.  I was originally planning on making this
an XArray feature, but I now believe it's a page-cache-special feature.
We can always revisit that decision if it turns out to be useful to
another user.

API:

int filemap_block_range(struct address_space *mapping, loff_t start,
		loff_t end);
void filemap_remove_block(struct address_space *mapping, loff_t start,
		loff_t end);

 - After removing a block, the pagecache is empty between [start, end].
 - You have to treat the block as a single entity; don't unblock only
   a subrange of the range you originally blocked.
 - Lookups of a page within a blocked range return NULL.
 - Attempts to add a page to a blocked range sleep on one of the
   page_wait_table queues.
 - Attempts to block a blocked range will also sleep on one of the
   page_wait_table queues.  Is this restriction acceptable for your use
   case?  It's clearly not a problem for fallocate insert/collapse.  It
   would only be a problem for Direct I/O if people are doing subpage
   directio from within the same page.  I think that's rare enough to
   not be a problem (but please tell me if I'm wrong!)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-18 13:13   ` Matthew Wilcox
@ 2018-05-18 15:53     ` Christoph Hellwig
  2018-05-18 17:45       ` Kent Overstreet
  2018-05-20 22:45       ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, peterz

On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > Historically, the only problematic case has been direct IO, and people
> > have been willing to say "well, if you mix buffered and direct IO you
> > get what you deserve", and that's probably not unreasonable. But now we
> > have fallocate insert range and collapse range, and those are broken in
> > ways I frankly don't want to think about if they can't ensure consistency
> > with the page cache.
> 
> ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> You may get pushback on the grounds that this ought to be a
> filesystem-specific lock rather than one embedded in the generic inode.

Honestly I think this probably should be in the core.  But IFF we move
it to the core the existing users of per-fs locks need to be moved
over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
that copied the approach, and probably more if you audit deep enough.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] mm: export find_get_pages()
  2018-05-18  7:49 ` [PATCH 02/10] mm: export find_get_pages() Kent Overstreet
@ 2018-05-18 16:00   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 03:49:02AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  mm/filemap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 31dd888785..78b99019bf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
>  
>  	return ret;
>  }
> +EXPORT_SYMBOL(find_get_pages_range);

EXPORT_SYMBOL_GPL and only together with the actual submission of
a user of the interface.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily
  2018-05-18  7:49 ` [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily Kent Overstreet
@ 2018-05-18 16:01   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:01 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 03:49:08AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Looks generally fine.  A little changelog with an explanation of
how we obviously never could get here with irqs disabled would
be nice, though.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/10] Generic radix trees
  2018-05-18  7:49 ` [PATCH 06/10] Generic radix trees Kent Overstreet
@ 2018-05-18 16:02   ` Christoph Hellwig
  2018-05-18 17:38     ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

Completely lacks any explanation, including why this should be
in lib/.  Also should come in the same series with actual users
of the infrastructure.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 08/10] bcache: move closures to lib/
  2018-05-18  7:49 ` [PATCH 08/10] bcache: move closures to lib/ Kent Overstreet
@ 2018-05-18 16:02   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 03:49:13AM -0400, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures

Hell no.  This code needs to go away and not actually be promoted to
lib/.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
  2018-05-18  7:49 ` [PATCH 10/10] Dynamic fault injection Kent Overstreet
@ 2018-05-18 16:02   ` Christoph Hellwig
  2018-05-18 17:37     ` Kent Overstreet
  2018-05-18 19:05   ` Andreas Dilger
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Completely lacks any explanation or argument why it would be useful.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
  2018-05-18 16:02   ` Christoph Hellwig
@ 2018-05-18 17:37     ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 09:02:45AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> 
> Completely lacks any explanation or argument why it would be useful.

It's in the cover letter...


 * Dynamic fault injection

 I've actually had this code sitting in my tree since forever... I know we have
 an existing fault injection framework, but I think this one is quite a bit
 nicer
 to actually use.

 It works very much like the dynamic debug infrastructure - for those who aren't
 familiar, dynamic debug makes it so you can list and individually
 enable/disable
 every pr_debug() callsite in debugfs.

 So to add a fault injection site with this, you just stick a call to
 dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true
 if
 you should fail whatever it is you're testing. And then it'll show up in
 debugfs, where you can enable/disable faults by file/linenumber, module, name,
 etc.

 The patch then also adds macros that wrap all the various memory allocation
 functions and fail if dynamic_fault("memory") returns true - which means you
 can
 see in debugfs every place you're allocating memory and fail all of them or
 just
 individually (I have tests that iterate over all the faults and flip them on
 one
 by one). I also use it in bcachefs to add fault injection points for uncommon
 error paths in the filesystem startup/recovery path, and for various hard to
 test slowpaths that only happen if we race in weird ways (race_fault()).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/10] Generic radix trees
  2018-05-18 16:02   ` Christoph Hellwig
@ 2018-05-18 17:38     ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 17:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 09:02:03AM -0700, Christoph Hellwig wrote:
> Completely lacks any explanation, including why this should be
> in lib/.  Also should come in the same series with actual users
> of the infrastructure.

Also in the cover letter...

 * Generic radix trees

 This is a very simple radix tree implementation that can store types of
 arbitrary size, not just pointers/unsigned long. It could probably replace
 flex arrays.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/10] RFC: assorted bcachefs patches
  2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
                   ` (10 preceding siblings ...)
  2018-05-18  7:55 ` [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
@ 2018-05-18 17:45 ` Josef Bacik
  2018-05-18 17:49   ` Kent Overstreet
  11 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2018-05-18 17:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> These are all the remaining patches in my bcachefs tree that touch stuff outside
> fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> some discussion first.
> 
>  * pagecache add lock
> 
> This is the only one that touches existing code in nontrivial ways.  The problem
> it's solving is that there is no existing general mechanism for shooting down
> pages in the page and keeping them removed, which is a real problem if you're
> doing anything that modifies file data and isn't buffered writes.
> 
> Historically, the only problematic case has been direct IO, and people have been
> willing to say "well, if you mix buffered and direct IO you get what you
> deserve", and that's probably not unreasonable. But now we have fallocate insert
> range and collapse range, and those are broken in ways I frankly don't want to
> think about if they can't ensure consistency with the page cache.
> 
> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we switched it
> to a more general rigorous mechanism.
> 
> I need this solved for bcachefs because without this mechanism, the page cache
> inconsistencies lead to various assertions popping (primarily when we didn't
> think we need to get a disk reservation going by page cache state, but then do
> the actual write and disk space accounting says oops, we did need one). And
> having to reason about what can happen without a locking mechanism for this is
> not something I care to spend brain cycles on.
> 
> That said, my patch is kind of ugly, and it requires filesystem changes for
> other filesystems to take advantage of it. And unfortunately, since one of the
> code paths that needs locking is readahead, I don't see any realistic way of
> implementing the locking within just bcachefs code.
> 
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> if not I'll polish up my pagecache add lock patch and see what I can do to make
> it less ugly, and hopefully other people find it palatable or at least useful.
> 
>  * lglocks
> 
> They were removed by Peter Zijlstra when the last in kernel user was removed,
> but I've found them useful. His commit message seems to imply he doesn't think
> people should be using them, but I'm not sure why. They are a bit niche though,
> I can move them to fs/bcachefs if people would prefer. 
> 
>  * Generic radix trees
> 
> This is a very simple radix tree implementation that can store types of
> arbitrary size, not just pointers/unsigned long. It could probably replace
> flex arrays.
> 
>  * Dynamic fault injection
> 

I've not looked at this at all so this may not cover your usecase, but I
implemeted a bpf_override_return() to do focused error injection a year ago.  I
have this script

https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py

that does it generically, all you have to do is tag the function you want to be
error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
things like a debugfs interface to trigger them or use the above script to
trigger specific errors and such.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-18 15:53     ` Christoph Hellwig
@ 2018-05-18 17:45       ` Kent Overstreet
  2018-05-23 23:55         ` Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock) Kent Overstreet
  2018-05-20 22:45       ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
  1 sibling, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 17:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, peterz

On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> > 
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
> 
> Honestly I think this probably should be in the core.  But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.

I didn't know about i_mmap_sem, thanks

But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
to block pagecache adds in bcachefs since the dio write could overwrite
uncompressed data or a reservation with compressed data, which means new writes
need a disk reservation.

Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
for reads that are already cached, the generic buffered read path can do cached
reads without taking any per inode locks - that's why I pushed the lock down to
only be taken when we have to add a page to the pagecache.

Definitely less ugly doing it that way though...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/10] RFC: assorted bcachefs patches
  2018-05-18 17:45 ` Josef Bacik
@ 2018-05-18 17:49   ` Kent Overstreet
  2018-05-18 18:03     ` Josef Bacik
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 17:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > These are all the remaining patches in my bcachefs tree that touch stuff outside
> > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> > some discussion first.
> > 
> >  * pagecache add lock
> > 
> > This is the only one that touches existing code in nontrivial ways.  The problem
> > it's solving is that there is no existing general mechanism for shooting down
> > pages in the page and keeping them removed, which is a real problem if you're
> > doing anything that modifies file data and isn't buffered writes.
> > 
> > Historically, the only problematic case has been direct IO, and people have been
> > willing to say "well, if you mix buffered and direct IO you get what you
> > deserve", and that's probably not unreasonable. But now we have fallocate insert
> > range and collapse range, and those are broken in ways I frankly don't want to
> > think about if they can't ensure consistency with the page cache.
> > 
> > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > historically been rather fragile, IMO it might be a good think if we switched it
> > to a more general rigorous mechanism.
> > 
> > I need this solved for bcachefs because without this mechanism, the page cache
> > inconsistencies lead to various assertions popping (primarily when we didn't
> > think we need to get a disk reservation going by page cache state, but then do
> > the actual write and disk space accounting says oops, we did need one). And
> > having to reason about what can happen without a locking mechanism for this is
> > not something I care to spend brain cycles on.
> > 
> > That said, my patch is kind of ugly, and it requires filesystem changes for
> > other filesystems to take advantage of it. And unfortunately, since one of the
> > code paths that needs locking is readahead, I don't see any realistic way of
> > implementing the locking within just bcachefs code.
> > 
> > So I'm hoping someone has an idea for something cleaner (I think I recall
> > Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> > if not I'll polish up my pagecache add lock patch and see what I can do to make
> > it less ugly, and hopefully other people find it palatable or at least useful.
> > 
> >  * lglocks
> > 
> > They were removed by Peter Zijlstra when the last in kernel user was removed,
> > but I've found them useful. His commit message seems to imply he doesn't think
> > people should be using them, but I'm not sure why. They are a bit niche though,
> > I can move them to fs/bcachefs if people would prefer. 
> > 
> >  * Generic radix trees
> > 
> > This is a very simple radix tree implementation that can store types of
> > arbitrary size, not just pointers/unsigned long. It could probably replace
> > flex arrays.
> > 
> >  * Dynamic fault injection
> > 
> 
> I've not looked at this at all so this may not cover your usecase, but I
> implemeted a bpf_override_return() to do focused error injection a year ago.  I
> have this script
> 
> https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
> 
> that does it generically, all you have to do is tag the function you want to be
> error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
> things like a debugfs interface to trigger them or use the above script to
> trigger specific errors and such.  Thanks,

That sounds pretty cool...

What about being able to add a random fault injection point in the middle of an
existing function? Being able to stick race_fault() in random places was a
pretty big win in terms of getting good code coverage out of realistic tests.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/10] RFC: assorted bcachefs patches
  2018-05-18 17:49   ` Kent Overstreet
@ 2018-05-18 18:03     ` Josef Bacik
  2018-05-18 18:28       ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2018-05-18 18:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, willy, peterz

On Fri, May 18, 2018 at 01:49:12PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> > On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > > These are all the remaining patches in my bcachefs tree that touch stuff outside
> > > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> > > some discussion first.
> > > 
> > >  * pagecache add lock
> > > 
> > > This is the only one that touches existing code in nontrivial ways.  The problem
> > > it's solving is that there is no existing general mechanism for shooting down
> > > pages in the page and keeping them removed, which is a real problem if you're
> > > doing anything that modifies file data and isn't buffered writes.
> > > 
> > > Historically, the only problematic case has been direct IO, and people have been
> > > willing to say "well, if you mix buffered and direct IO you get what you
> > > deserve", and that's probably not unreasonable. But now we have fallocate insert
> > > range and collapse range, and those are broken in ways I frankly don't want to
> > > think about if they can't ensure consistency with the page cache.
> > > 
> > > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > > historically been rather fragile, IMO it might be a good think if we switched it
> > > to a more general rigorous mechanism.
> > > 
> > > I need this solved for bcachefs because without this mechanism, the page cache
> > > inconsistencies lead to various assertions popping (primarily when we didn't
> > > think we need to get a disk reservation going by page cache state, but then do
> > > the actual write and disk space accounting says oops, we did need one). And
> > > having to reason about what can happen without a locking mechanism for this is
> > > not something I care to spend brain cycles on.
> > > 
> > > That said, my patch is kind of ugly, and it requires filesystem changes for
> > > other filesystems to take advantage of it. And unfortunately, since one of the
> > > code paths that needs locking is readahead, I don't see any realistic way of
> > > implementing the locking within just bcachefs code.
> > > 
> > > So I'm hoping someone has an idea for something cleaner (I think I recall
> > > Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> > > if not I'll polish up my pagecache add lock patch and see what I can do to make
> > > it less ugly, and hopefully other people find it palatable or at least useful.
> > > 
> > >  * lglocks
> > > 
> > > They were removed by Peter Zijlstra when the last in kernel user was removed,
> > > but I've found them useful. His commit message seems to imply he doesn't think
> > > people should be using them, but I'm not sure why. They are a bit niche though,
> > > I can move them to fs/bcachefs if people would prefer. 
> > > 
> > >  * Generic radix trees
> > > 
> > > This is a very simple radix tree implementation that can store types of
> > > arbitrary size, not just pointers/unsigned long. It could probably replace
> > > flex arrays.
> > > 
> > >  * Dynamic fault injection
> > > 
> > 
> > I've not looked at this at all so this may not cover your usecase, but I
> > implemeted a bpf_override_return() to do focused error injection a year ago.  I
> > have this script
> > 
> > https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
> > 
> > that does it generically, all you have to do is tag the function you want to be
> > error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
> > things like a debugfs interface to trigger them or use the above script to
> > trigger specific errors and such.  Thanks,
> 
> That sounds pretty cool...
> 
> What about being able to add a random fault injection point in the middle of an
> existing function? Being able to stick race_fault() in random places was a
> pretty big win in terms of getting good code coverage out of realistic tests.

There's nothing stopping us from doing that, it just uses a kprobe to override
the function with our helper, so we could conceivably put it anywhere in the
function.  The reason I limited it to individual functions was because it was
easier than trying to figure out the side-effects of stopping mid-function.  If
I needed to fail mid-function I just added a helper where I needed it and failed
that instead.  I imagine safety is going to be of larger concern if we allow bpf
scripts to randomly return anywhere inside a function, even if the function is
marked as allowing error injection.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/10] RFC: assorted bcachefs patches
  2018-05-18 18:03     ` Josef Bacik
@ 2018-05-18 18:28       ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 18:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 02:03:25PM -0400, Josef Bacik wrote:
> There's nothing stopping us from doing that, it just uses a kprobe to override
> the function with our helper, so we could conceivably put it anywhere in the
> function.  The reason I limited it to individual functions was because it was
> easier than trying to figure out the side-effects of stopping mid-function.  If
> I needed to fail mid-function I just added a helper where I needed it and failed
> that instead.  I imagine safety is going to be of larger concern if we allow bpf
> scripts to randomly return anywhere inside a function, even if the function is
> marked as allowing error injection.  Thanks,

Ahh no, that's not what I want... here's an example:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/btree_cache.c#n674

Here we've got to do this thing which can race - which is fine, we just need to
check for and handle the race, on line 709 - but actually exercising that with a
test is difficult since it requires a heavily multithreaded workload with btree
nodes getting evicted to see it happen, so - it pretends the race happened if
race_fault() returns true. The race_fault() invocation shows up in debugfs,
where userspace can tell it to fire.

the way it works is dynamic_fault() is a macro that expands to a static struct
dfault_descriptor, stuck in a particular linker section so the dynamic fault
code can find them and stick them in debugfs (which is also the way dynamic
debug works).

#define dynamic_fault(_class)						\
({									\
	static struct _dfault descriptor				\
	__used __aligned(8) __attribute__((section("__faults"))) = {	\
		.modname	= KBUILD_MODNAME,			\
		.function	= __func__,				\
		.filename	= __FILE__,				\
		.line		= __LINE__,				\
		.class		= _class,				\
	};								\
									\
	static_key_false(&descriptor.enabled) &&			\
		__dynamic_fault_enabled(&descriptor);			\
})

Honestly it still seems like the cleanest and safest way of doing it to me...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
  2018-05-18  7:49 ` [PATCH 10/10] Dynamic fault injection Kent Overstreet
  2018-05-18 16:02   ` Christoph Hellwig
@ 2018-05-18 19:05   ` Andreas Dilger
  2018-05-18 19:10     ` Kent Overstreet
  1 sibling, 1 reply; 43+ messages in thread
From: Andreas Dilger @ 2018-05-18 19:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

[-- Attachment #1: Type: text/plain, Size: 27674 bytes --]

On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself.  The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.


That said, I think this is a useful functionality.  We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated.  This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.


Some things that are missing from this patch that is in our code:

- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
  - timeout: sleep for N msec to simulate network/disk/locking delays
  - race: wait with one thread until a second thread hits matching check

We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).

Cheers, Andreas

> ---
> include/asm-generic/vmlinux.lds.h |   4 +
> include/linux/dynamic_fault.h     | 117 +++++
> lib/Kconfig.debug                 |   5 +
> lib/Makefile                      |   2 +
> lib/dynamic_fault.c               | 760 ++++++++++++++++++++++++++++++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
> 
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
> 	VMLINUX_SYMBOL(__start___verbose) = .;                          \
> 	KEEP(*(__verbose))                                              \
> 	VMLINUX_SYMBOL(__stop___verbose) = .;				\
> +	. = ALIGN(8);							\
> +	VMLINUX_SYMBOL(__start___faults) = .;                           \
> +	*(__faults)                                                     \
> +	VMLINUX_SYMBOL(__stop___faults) = .;				\
> 	LIKELY_PROFILE()		       				\
> 	BRANCH_PROFILE()						\
> 	TRACE_PRINTKS()							\
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 0000000000..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include <linux/bio.h>
> +#include <linux/jump_label.h>
> +#include <linux/slab.h>
> +
> +enum dfault_enabled {
> +	DFAULT_DISABLED,
> +	DFAULT_ENABLED,
> +	DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> +	struct {
> +		unsigned		enabled:2;
> +		unsigned		count:30;
> +	};
> +
> +	struct {
> +		unsigned		v;
> +	};
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite.  At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> +	const char		*modname;
> +	const char		*function;
> +	const char		*filename;
> +	const char		*class;
> +
> +	const u16		line;
> +
> +	unsigned		frequency;
> +	union dfault_state	state;
> +
> +	struct static_key	enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class)						\
> +({									\
> +	static struct _dfault descriptor				\
> +	__used __aligned(8) __attribute__((section("__faults"))) = {	\
> +		.modname	= KBUILD_MODNAME,			\
> +		.function	= __func__,				\
> +		.filename	= __FILE__,				\
> +		.line		= __LINE__,				\
> +		.class		= _class,				\
> +	};								\
> +									\
> +	static_key_false(&descriptor.enabled) &&			\
> +		__dynamic_fault_enabled(&descriptor);			\
> +})
> +
> +#define memory_fault()		dynamic_fault("memory")
> +#define race_fault()		dynamic_fault("race")
> +
> +#define kmalloc(...)							\
> +	(memory_fault() ? NULL	: kmalloc(__VA_ARGS__))
> +#define kzalloc(...)							\
> +	(memory_fault() ? NULL	: kzalloc(__VA_ARGS__))
> +#define krealloc(...)							\
> +	(memory_fault() ? NULL	: krealloc(__VA_ARGS__))
> +
> +#define mempool_alloc(pool, gfp_mask)					\
> +	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
> +		? NULL : mempool_alloc(pool, gfp_mask))
> +
> +#define __get_free_pages(...)						\
> +	(memory_fault() ? 0	: __get_free_pages(__VA_ARGS__))
> +#define alloc_pages_node(...)						\
> +	(memory_fault() ? NULL	: alloc_pages_node(__VA_ARGS__))
> +#define alloc_pages_nodemask(...)					\
> +	(memory_fault() ? NULL	: alloc_pages_nodemask(__VA_ARGS__))
> +
> +#define bio_alloc_bioset(gfp_mask, ...)					\
> +	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
> +	 ? NULL	: bio_alloc_bioset(gfp_mask, __VA_ARGS__))
> +
> +#define bio_clone(bio, gfp_mask)					\
> +	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
> +	 ? NULL	: bio_clone(bio, gfp_mask))
> +
> +#define bio_clone_bioset(bio, gfp_mask, bs)				\
> +	((!gfpflags_allow_blocking(gfp_mask) && memory_fault())		\
> +	 ? NULL	: bio_clone_bioset(bio, gfp_mask, bs))
> +
> +#define bio_kmalloc(...)						\
> +	(memory_fault() ? NULL		: bio_kmalloc(__VA_ARGS__))
> +#define bio_clone_kmalloc(...)						\
> +	(memory_fault() ? NULL		: bio_clone_kmalloc(__VA_ARGS__))
> +
> +#define bio_iov_iter_get_pages(...)					\
> +	(memory_fault() ? -ENOMEM	: bio_iov_iter_get_pages(__VA_ARGS__))
> +
> +#else /* CONFIG_DYNAMIC_FAULT */
> +
> +#define dfault_add_module(tab, n, modname)	0
> +#define dfault_remove_module(mod)		0
> +#define dynamic_fault(_class)			0
> +#define memory_fault()				0
> +#define race_fault()				0
> +
> +#endif /* CONFIG_DYNAMIC_FAULT */
> +
> +#endif
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 427292a759..7b7ca0d813 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1599,6 +1599,11 @@ config LATENCYTOP
> 	  Enable this option if you want to use the LatencyTOP tool
> 	  to find out which userspace is blocking on what kernel operations.
> 
> +config DYNAMIC_FAULT
> +	bool "Enable dynamic fault support"
> +	default n
> +	depends on DEBUG_FS
> +
> source kernel/trace/Kconfig
> 
> config PROVIDE_OHCI1394_DMA_INIT
> diff --git a/lib/Makefile b/lib/Makefile
> index 66d2231d70..f6f70f4771 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -158,6 +158,8 @@ obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> 
> obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
> 
> +obj-$(CONFIG_DYNAMIC_FAULT) += dynamic_fault.o
> +
> obj-$(CONFIG_NLATTR) += nlattr.o
> 
> obj-$(CONFIG_LRU_CACHE) += lru_cache.o
> diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
> new file mode 100644
> index 0000000000..75fc9a1b4b
> --- /dev/null
> +++ b/lib/dynamic_fault.c
> @@ -0,0 +1,760 @@
> +/*
> + * lib/dynamic_fault.c
> + *
> + * make dynamic_fault() calls runtime configurable based upon their
> + * source module.
> + *
> + * Copyright (C) 2011 Adam Berkan <aberkan@google.com>
> + * Based on dynamic_debug.c:
> + * Copyright (C) 2008 Jason Baron <jbaron@redhat.com>
> + * By Greg Banks <gnb@melbourne.sgi.com>
> + * Copyright (c) 2008 Silicon Graphics Inc.  All Rights Reserved.
> + *
> + */
> +
> +#define pr_fmt(fmt) "dfault: " fmt "\n"
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/kallsyms.h>
> +#include <linux/version.h>
> +#include <linux/types.h>
> +#include <linux/mutex.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/list.h>
> +#include <linux/sysctl.h>
> +#include <linux/ctype.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/dynamic_fault.h>
> +#include <linux/debugfs.h>
> +#include <linux/slab.h>
> +
> +#undef kzalloc
> +
> +extern struct _dfault __start___faults[];
> +extern struct _dfault __stop___faults[];
> +
> +struct dfault_table {
> +	struct list_head link;
> +	char *mod_name;
> +	unsigned int num_dfaults;
> +	struct _dfault *dfaults;
> +};
> +
> +struct dfault_query {
> +	const char	*filename;
> +	const char	*module;
> +	const char	*function;
> +	const char	*class;
> +	unsigned int	first_line, last_line;
> +	unsigned int	first_index, last_index;
> +
> +	unsigned	match_line:1;
> +	unsigned	match_index:1;
> +
> +	unsigned	set_enabled:1;
> +	unsigned	enabled:2;
> +
> +	unsigned	set_frequency:1;
> +	unsigned	frequency;
> +};
> +
> +struct dfault_iter {
> +	struct dfault_table *table;
> +	unsigned int idx;
> +};
> +
> +static DEFINE_MUTEX(dfault_lock);
> +static LIST_HEAD(dfault_tables);
> +
> +bool __dynamic_fault_enabled(struct _dfault *df)
> +{
> +	union dfault_state old, new;
> +	unsigned v = df->state.v;
> +	bool ret;
> +
> +	do {
> +		old.v = new.v = v;
> +
> +		if (new.enabled == DFAULT_DISABLED)
> +			return false;
> +
> +		ret = df->frequency
> +			? ++new.count >= df->frequency
> +			: true;
> +		if (ret)
> +			new.count = 0;
> +		if (ret && new.enabled == DFAULT_ONESHOT)
> +			new.enabled = DFAULT_DISABLED;
> +	} while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
> +
> +	if (ret)
> +		pr_debug("returned true for %s:%u", df->filename, df->line);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__dynamic_fault_enabled);
> +
> +/* Return the last part of a pathname */
> +static inline const char *basename(const char *path)
> +{
> +	const char *tail = strrchr(path, '/');
> +
> +	return tail ? tail + 1 : path;
> +}
> +
> +/* format a string into buf[] which describes the _dfault's flags */
> +static char *dfault_describe_flags(struct _dfault *df, char *buf, size_t buflen)
> +{
> +	switch (df->state.enabled) {
> +	case DFAULT_DISABLED:
> +		strlcpy(buf, "disabled", buflen);
> +		break;
> +	case DFAULT_ENABLED:
> +		strlcpy(buf, "enabled", buflen);
> +		break;
> +	case DFAULT_ONESHOT:
> +		strlcpy(buf, "oneshot", buflen);
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	return buf;
> +}
> +
> +/*
> + * must be called with dfault_lock held
> + */
> +
> +/*
> + * Search the tables for _dfault's which match the given
> + * `query' and apply the `flags' and `mask' to them.  Tells
> + * the user which dfault's were changed, or whether none
> + * were matched.
> + */
> +static int dfault_change(const struct dfault_query *query)
> +{
> +	struct dfault_table *dt;
> +	unsigned int nfound = 0;
> +	unsigned i, index = 0;
> +	char flagbuf[16];
> +
> +	/* search for matching dfaults */
> +	mutex_lock(&dfault_lock);
> +	list_for_each_entry(dt, &dfault_tables, link) {
> +
> +		/* match against the module name */
> +		if (query->module != NULL &&
> +		    strcmp(query->module, dt->mod_name))
> +			continue;
> +
> +		for (i = 0 ; i < dt->num_dfaults ; i++) {
> +			struct _dfault *df = &dt->dfaults[i];
> +
> +			/* match against the source filename */
> +			if (query->filename != NULL &&
> +			    strcmp(query->filename, df->filename) &&
> +			    strcmp(query->filename, basename(df->filename)))
> +				continue;
> +
> +			/* match against the function */
> +			if (query->function != NULL &&
> +			    strcmp(query->function, df->function))
> +				continue;
> +
> +			/* match against the class */
> +			if (query->class) {
> +				size_t len = strlen(query->class);
> +
> +				if (strncmp(query->class, df->class, len))
> +					continue;
> +
> +				if (df->class[len] && df->class[len] != ':')
> +					continue;
> +			}
> +
> +			/* match against the line number range */
> +			if (query->match_line &&
> +			    (df->line < query->first_line ||
> +			     df->line > query->last_line))
> +				continue;
> +
> +			/* match against the fault index */
> +			if (query->match_index &&
> +			    (index < query->first_index ||
> +			     index > query->last_index)) {
> +				index++;
> +				continue;
> +			}
> +
> +			if (query->set_enabled &&
> +			    query->enabled != df->state.enabled) {
> +				if (query->enabled != DFAULT_DISABLED)
> +					static_key_slow_inc(&df->enabled);
> +				else if (df->state.enabled != DFAULT_DISABLED)
> +					static_key_slow_dec(&df->enabled);
> +
> +				df->state.enabled = query->enabled;
> +			}
> +
> +			if (query->set_frequency)
> +				df->frequency = query->frequency;
> +
> +			pr_debug("changed %s:%d [%s]%s #%d %s",
> +				 df->filename, df->line, dt->mod_name,
> +				 df->function, index,
> +				 dfault_describe_flags(df, flagbuf,
> +						       sizeof(flagbuf)));
> +
> +			index++;
> +			nfound++;
> +		}
> +	}
> +	mutex_unlock(&dfault_lock);
> +
> +	pr_debug("dfault: %u matches", nfound);
> +
> +	return nfound ? 0 : -ENOENT;
> +}
> +
> +/*
> + * Split the buffer `buf' into space-separated words.
> + * Handles simple " and ' quoting, i.e. without nested,
> + * embedded or escaped \".  Return the number of words
> + * or <0 on error.
> + */
> +static int dfault_tokenize(char *buf, char *words[], int maxwords)
> +{
> +	int nwords = 0;
> +
> +	while (*buf) {
> +		char *end;
> +
> +		/* Skip leading whitespace */
> +		buf = skip_spaces(buf);
> +		if (!*buf)
> +			break;	/* oh, it was trailing whitespace */
> +
> +		/* Run `end' over a word, either whitespace separated or quoted
> +		 */
> +		if (*buf == '"' || *buf == '\'') {
> +			int quote = *buf++;
> +
> +			for (end = buf ; *end && *end != quote ; end++)
> +				;
> +			if (!*end)
> +				return -EINVAL;	/* unclosed quote */
> +		} else {
> +			for (end = buf ; *end && !isspace(*end) ; end++)
> +				;
> +			BUG_ON(end == buf);
> +		}
> +		/* Here `buf' is the start of the word, `end' is one past the
> +		 * end
> +		 */
> +
> +		if (nwords == maxwords)
> +			return -EINVAL;	/* ran out of words[] before bytes */
> +		if (*end)
> +			*end++ = '\0';	/* terminate the word */
> +		words[nwords++] = buf;
> +		buf = end;
> +	}
> +
> +	return nwords;
> +}
> +
> +/*
> + * Parse a range.
> + */
> +static inline int parse_range(char *str,
> +			      unsigned int *first,
> +			      unsigned int *last)
> +{
> +	char *first_str = str;
> +	char *last_str = strchr(first_str, '-');
> +
> +	if (last_str)
> +		*last_str++ = '\0';
> +
> +	if (kstrtouint(first_str, 10, first))
> +		return -EINVAL;
> +
> +	if (!last_str)
> +		*last = *first;
> +	else if (kstrtouint(last_str, 10, last))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +enum dfault_token {
> +	TOK_INVALID,
> +
> +	/* Queries */
> +	TOK_FUNC,
> +	TOK_FILE,
> +	TOK_LINE,
> +	TOK_MODULE,
> +	TOK_CLASS,
> +	TOK_INDEX,
> +
> +	/* Commands */
> +	TOK_DISABLE,
> +	TOK_ENABLE,
> +	TOK_ONESHOT,
> +	TOK_FREQUENCY,
> +};
> +
> +static const struct {
> +	const char		*str;
> +	enum dfault_token	tok;
> +	unsigned		args_required;
> +} dfault_token_strs[] = {
> +	{ "func",	TOK_FUNC,	1,	},
> +	{ "file",	TOK_FILE,	1,	},
> +	{ "line",	TOK_LINE,	1,	},
> +	{ "module",	TOK_MODULE,	1,	},
> +	{ "class",	TOK_CLASS,	1,	},
> +	{ "index",	TOK_INDEX,	1,	},
> +	{ "disable",	TOK_DISABLE,	0,	},
> +	{ "enable",	TOK_ENABLE,	0,	},
> +	{ "oneshot",	TOK_ONESHOT,	0,	},
> +	{ "frequency",	TOK_FREQUENCY,	1,	},
> +};
> +
> +static enum dfault_token str_to_token(const char *word, unsigned nr_words)
> +{
> +	unsigned i;
> +
> +	for (i = 0; i < ARRAY_SIZE(dfault_token_strs); i++)
> +		if (!strcmp(word, dfault_token_strs[i].str)) {
> +			if (nr_words < dfault_token_strs[i].args_required) {
> +				pr_debug("insufficient arguments to \"%s\"",
> +					 word);
> +				return TOK_INVALID;
> +			}
> +
> +			return dfault_token_strs[i].tok;
> +		}
> +
> +	pr_debug("unknown keyword \"%s\"", word);
> +
> +	return TOK_INVALID;
> +}
> +
> +static int dfault_parse_command(struct dfault_query *query,
> +				enum dfault_token tok,
> +				char *words[], size_t nr_words)
> +{
> +	unsigned i = 0;
> +	int ret;
> +
> +	switch (tok) {
> +	case TOK_INVALID:
> +		return -EINVAL;
> +	case TOK_FUNC:
> +		query->function = words[i++];
> +	case TOK_FILE:
> +		query->filename = words[i++];
> +		return 1;
> +	case TOK_LINE:
> +		ret = parse_range(words[i++],
> +				  &query->first_line,
> +				  &query->last_line);
> +		if (ret)
> +			return ret;
> +		query->match_line = true;
> +		break;
> +	case TOK_MODULE:
> +		query->module = words[i++];
> +		break;
> +	case TOK_CLASS:
> +		query->class = words[i++];
> +		break;
> +	case TOK_INDEX:
> +		ret = parse_range(words[i++],
> +				  &query->first_index,
> +				  &query->last_index);
> +		if (ret)
> +			return ret;
> +		query->match_index = true;
> +		break;
> +	case TOK_DISABLE:
> +		query->set_enabled = true;
> +		query->enabled = DFAULT_DISABLED;
> +		break;
> +	case TOK_ENABLE:
> +		query->set_enabled = true;
> +		query->enabled = DFAULT_ENABLED;
> +		break;
> +	case TOK_ONESHOT:
> +		query->set_enabled = true;
> +		query->enabled = DFAULT_ONESHOT;
> +		break;
> +	case TOK_FREQUENCY:
> +		query->set_frequency = 1;
> +		ret = kstrtouint(words[i++], 10, &query->frequency);
> +		if (ret)
> +			return ret;
> +
> +		if (!query->set_enabled) {
> +			query->set_enabled = 1;
> +			query->enabled = DFAULT_ENABLED;
> +		}
> +		break;
> +	}
> +
> +	return i;
> +}
> +
> +/*
> + * Parse words[] as a dfault query specification, which is a series
> + * of (keyword, value) pairs chosen from these possibilities:
> + *
> + * func <function-name>
> + * file <full-pathname>
> + * file <base-filename>
> + * module <module-name>
> + * line <lineno>
> + * line <first-lineno>-<last-lineno> // where either may be empty
> + * index <m>-<n>                     // dynamic faults numbered from <m>
> + *                                   // to <n> inside each matching function
> + */
> +static int dfault_parse_query(struct dfault_query *query,
> +			      char *words[], size_t nr_words)
> +{
> +	unsigned i = 0;
> +
> +	while (i < nr_words) {
> +		const char *tok_str = words[i++];
> +		enum dfault_token tok = str_to_token(tok_str, nr_words - i);
> +		int ret = dfault_parse_command(query, tok, words + i,
> +					       nr_words - i);
> +
> +		if (ret < 0)
> +			return ret;
> +		i += ret;
> +		BUG_ON(i > nr_words);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * File_ops->write method for <debugfs>/dynamic_fault/conrol.  Gathers the
> + * command text from userspace, parses and executes it.
> + */
> +static ssize_t dfault_proc_write(struct file *file, const char __user *ubuf,
> +				  size_t len, loff_t *offp)
> +{
> +	struct dfault_query query;
> +#define MAXWORDS 9
> +	int nwords;
> +	char *words[MAXWORDS];
> +	char tmpbuf[256];
> +	int ret;
> +
> +	memset(&query, 0, sizeof(query));
> +
> +	if (len == 0)
> +		return 0;
> +	/* we don't check *offp -- multiple writes() are allowed */
> +	if (len > sizeof(tmpbuf)-1)
> +		return -E2BIG;
> +	if (copy_from_user(tmpbuf, ubuf, len))
> +		return -EFAULT;
> +	tmpbuf[len] = '\0';
> +
> +	pr_debug("read %zu bytes from userspace", len);
> +
> +	nwords = dfault_tokenize(tmpbuf, words, MAXWORDS);
> +	if (nwords < 0)
> +		return -EINVAL;
> +	if (dfault_parse_query(&query, words, nwords))
> +		return -EINVAL;
> +
> +	/* actually go and implement the change */
> +	ret = dfault_change(&query);
> +	if (ret < 0)
> +		return ret;
> +
> +	*offp += len;
> +	return len;
> +}
> +
> +/* Control file read code */
> +
> +/*
> + * Set the iterator to point to the first _dfault object
> + * and return a pointer to that first object.  Returns
> + * NULL if there are no _dfaults at all.
> + */
> +static struct _dfault *dfault_iter_first(struct dfault_iter *iter)
> +{
> +	if (list_empty(&dfault_tables)) {
> +		iter->table = NULL;
> +		iter->idx = 0;
> +		return NULL;
> +	}
> +	iter->table = list_entry(dfault_tables.next,
> +				 struct dfault_table, link);
> +	iter->idx = 0;
> +	return &iter->table->dfaults[iter->idx];
> +}
> +
> +/*
> + * Advance the iterator to point to the next _dfault
> + * object from the one the iterator currently points at,
> + * and returns a pointer to the new _dfault.  Returns
> + * NULL if the iterator has seen all the _dfaults.
> + */
> +static struct _dfault *dfault_iter_next(struct dfault_iter *iter)
> +{
> +	if (iter->table == NULL)
> +		return NULL;
> +	if (++iter->idx == iter->table->num_dfaults) {
> +		/* iterate to next table */
> +		iter->idx = 0;
> +		if (list_is_last(&iter->table->link, &dfault_tables)) {
> +			iter->table = NULL;
> +			return NULL;
> +		}
> +		iter->table = list_entry(iter->table->link.next,
> +					 struct dfault_table, link);
> +	}
> +	return &iter->table->dfaults[iter->idx];
> +}
> +
> +/*
> + * Seq_ops start method.  Called at the start of every
> + * read() call from userspace.  Takes the dfault_lock and
> + * seeks the seq_file's iterator to the given position.
> + */
> +static void *dfault_proc_start(struct seq_file *m, loff_t *pos)
> +{
> +	struct dfault_iter *iter = m->private;
> +	struct _dfault *dp;
> +	int n = *pos;
> +
> +	mutex_lock(&dfault_lock);
> +
> +	if (n < 0)
> +		return NULL;
> +	dp = dfault_iter_first(iter);
> +	while (dp != NULL && --n >= 0)
> +		dp = dfault_iter_next(iter);
> +	return dp;
> +}
> +
> +/*
> + * Seq_ops next method.  Called several times within a read()
> + * call from userspace, with dfault_lock held.  Walks to the
> + * next _dfault object with a special case for the header line.
> + */
> +static void *dfault_proc_next(struct seq_file *m, void *p, loff_t *pos)
> +{
> +	struct dfault_iter *iter = m->private;
> +	struct _dfault *dp;
> +
> +	if (p == SEQ_START_TOKEN)
> +		dp = dfault_iter_first(iter);
> +	else
> +		dp = dfault_iter_next(iter);
> +	++*pos;
> +	return dp;
> +}
> +
> +/*
> + * Seq_ops show method.  Called several times within a read()
> + * call from userspace, with dfault_lock held.  Formats the
> + * current _dfault as a single human-readable line, with a
> + * special case for the header line.
> + */
> +static int dfault_proc_show(struct seq_file *m, void *p)
> +{
> +	struct dfault_iter *iter = m->private;
> +	struct _dfault *df = p;
> +	char flagsbuf[8];
> +
> +	seq_printf(m, "%s:%u class:%s module:%s func:%s %s \"\"\n",
> +		   df->filename, df->line, df->class,
> +		   iter->table->mod_name, df->function,
> +		   dfault_describe_flags(df, flagsbuf, sizeof(flagsbuf)));
> +
> +	return 0;
> +}
> +
> +/*
> + * Seq_ops stop method.  Called at the end of each read()
> + * call from userspace.  Drops dfault_lock.
> + */
> +static void dfault_proc_stop(struct seq_file *m, void *p)
> +{
> +	mutex_unlock(&dfault_lock);
> +}
> +
> +static const struct seq_operations dfault_proc_seqops = {
> +	.start = dfault_proc_start,
> +	.next = dfault_proc_next,
> +	.show = dfault_proc_show,
> +	.stop = dfault_proc_stop
> +};
> +
> +/*
> + * File_ops->open method for <debugfs>/dynamic_fault/control.  Does the seq_file
> + * setup dance, and also creates an iterator to walk the _dfaults.
> + * Note that we create a seq_file always, even for O_WRONLY files
> + * where it's not needed, as doing so simplifies the ->release method.
> + */
> +static int dfault_proc_open(struct inode *inode, struct file *file)
> +{
> +	struct dfault_iter *iter;
> +	int err;
> +
> +	iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> +	if (iter == NULL)
> +		return -ENOMEM;
> +
> +	err = seq_open(file, &dfault_proc_seqops);
> +	if (err) {
> +		kfree(iter);
> +		return err;
> +	}
> +	((struct seq_file *) file->private_data)->private = iter;
> +	return 0;
> +}
> +
> +static const struct file_operations dfault_proc_fops = {
> +	.owner = THIS_MODULE,
> +	.open = dfault_proc_open,
> +	.read = seq_read,
> +	.llseek = seq_lseek,
> +	.release = seq_release_private,
> +	.write = dfault_proc_write
> +};
> +
> +/*
> + * Allocate a new dfault_table for the given module
> + * and add it to the global list.
> + */
> +int dfault_add_module(struct _dfault *tab, unsigned int n,
> +		      const char *name)
> +{
> +	struct dfault_table *dt;
> +	char *new_name;
> +	const char *func = NULL;
> +	int i;
> +
> +	dt = kzalloc(sizeof(*dt), GFP_KERNEL);
> +	if (dt == NULL)
> +		return -ENOMEM;
> +	new_name = kstrdup(name, GFP_KERNEL);
> +	if (new_name == NULL) {
> +		kfree(dt);
> +		return -ENOMEM;
> +	}
> +	dt->mod_name = new_name;
> +	dt->num_dfaults = n;
> +	dt->dfaults = tab;
> +
> +	mutex_lock(&dfault_lock);
> +	list_add_tail(&dt->link, &dfault_tables);
> +	mutex_unlock(&dfault_lock);
> +
> +	/* __attribute__(("section")) emits things in reverse order */
> +	for (i = n - 1; i >= 0; i--)
> +		if (!func || strcmp(tab[i].function, func))
> +			func = tab[i].function;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dfault_add_module);
> +
> +static void dfault_table_free(struct dfault_table *dt)
> +{
> +	list_del_init(&dt->link);
> +	kfree(dt->mod_name);
> +	kfree(dt);
> +}
> +
> +/*
> + * Called in response to a module being unloaded.  Removes
> + * any dfault_table's which point at the module.
> + */
> +int dfault_remove_module(char *mod_name)
> +{
> +	struct dfault_table *dt, *nextdt;
> +	int ret = -ENOENT;
> +
> +	mutex_lock(&dfault_lock);
> +	list_for_each_entry_safe(dt, nextdt, &dfault_tables, link) {
> +		if (!strcmp(dt->mod_name, mod_name)) {
> +			dfault_table_free(dt);
> +			ret = 0;
> +		}
> +	}
> +	mutex_unlock(&dfault_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dfault_remove_module);
> +
> +static void dfault_remove_all_tables(void)
> +{
> +	mutex_lock(&dfault_lock);
> +	while (!list_empty(&dfault_tables)) {
> +		struct dfault_table *dt = list_entry(dfault_tables.next,
> +						      struct dfault_table,
> +						      link);
> +		dfault_table_free(dt);
> +	}
> +	mutex_unlock(&dfault_lock);
> +}
> +
> +static int __init dynamic_fault_init(void)
> +{
> +	struct dentry *dir, *file;
> +	struct _dfault *iter, *iter_start;
> +	const char *modname = NULL;
> +	int ret = 0;
> +	int n = 0;
> +
> +	dir = debugfs_create_dir("dynamic_fault", NULL);
> +	if (!dir)
> +		return -ENOMEM;
> +	file = debugfs_create_file("control", 0644, dir, NULL,
> +					&dfault_proc_fops);
> +	if (!file) {
> +		debugfs_remove(dir);
> +		return -ENOMEM;
> +	}
> +	if (__start___faults != __stop___faults) {
> +		iter = __start___faults;
> +		modname = iter->modname;
> +		iter_start = iter;
> +		for (; iter < __stop___faults; iter++) {
> +			if (strcmp(modname, iter->modname)) {
> +				ret = dfault_add_module(iter_start, n, modname);
> +				if (ret)
> +					goto out_free;
> +				n = 0;
> +				modname = iter->modname;
> +				iter_start = iter;
> +			}
> +			n++;
> +		}
> +		ret = dfault_add_module(iter_start, n, modname);
> +	}
> +out_free:
> +	if (ret) {
> +		dfault_remove_all_tables();
> +		debugfs_remove(dir);
> +		debugfs_remove(file);
> +	}
> +	return 0;
> +}
> +module_init(dynamic_fault_init);
> --
> 2.17.0
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
  2018-05-18 19:05   ` Andreas Dilger
@ 2018-05-18 19:10     ` Kent Overstreet
  2018-05-18 20:54       ` Andreas Dilger
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-18 19:10 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Dave Chinner,
	darrick.wong, tytso, linux-btrfs, clm, jbacik, viro, willy,
	peterz

On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> 
> I agree with Christoph that even if there was some explanation in the cover
> letter, there should be something at least as good in the patch itself.  The
> cover letter is not saved, but the commit stays around forever, and should
> explain how this should be added to code, and how to use it from userspace.
> 
> 
> That said, I think this is a useful functionality.  We have something similar
> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
> test a distributed filesystem, which is just a CPP macro with an unlikely()
> branch, while this looks more sophisticated.  This looks like it has some
> added functionality like having more than one fault enabled at a time.
> If this lands we could likely switch our code over to using this.

This is pretty much what I was looking for, I just wanted to know if this patch
was interesting enough to anyone that I should spend more time on it or just
drop it :) Agreed on documentation. I think it's also worth factoring out the
functionality for the elf section trick that dynamic debug uses too.

> Some things that are missing from this patch that is in our code:
> 
> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>   - timeout: sleep for N msec to simulate network/disk/locking delays
>   - race: wait with one thread until a second thread hits matching check
> 
> We also have a "fail_val" that allows making the check conditional (e.g.
> only operation on server "N" should fail, only RPC opcode "N", etc).

Those all sound like good ideas... fail_val especially, I think with that we'd
have all the functionality the existing fault injection framework has (which is
way to heavyweight to actually get used, imo)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
  2018-05-18 19:10     ` Kent Overstreet
@ 2018-05-18 20:54       ` Andreas Dilger
  0 siblings, 0 replies; 43+ messages in thread
From: Andreas Dilger @ 2018-05-18 20:54 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: LKML, fsdevel, Andrew Morton, Dave Chinner, darrick.wong, tytso,
	linux-btrfs, clm, jbacik, viro, willy, peterz

[-- Attachment #1: Type: text/plain, Size: 2908 bytes --]

On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-18 15:53     ` Christoph Hellwig
  2018-05-18 17:45       ` Kent Overstreet
@ 2018-05-20 22:45       ` Kent Overstreet
  2018-05-23 15:22         ` Christoph Hellwig
  1 sibling, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2018-05-20 22:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, peterz

On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> > 
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
> 
> Honestly I think this probably should be in the core.  But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.

I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
merging bcachefs. Sorry, but that's a bit crazy.

I am more than happy to work on the locking itself if we can agree on what
semantics we want out of it. We have two possible approaches, and we're going to
have to pick one first: the locking can be done at the top of the IO stack (like
ext4 and I'm guessing xfs), but then we're adding locking overhead to buffered
reads and writes that don't need it because they're only touching pages that are
already in cache.

Or we can go with my approach, pushing down the locking to only when we need to
add pages to the page cache. I think if we started out by merging my approach,
it would be pretty easy to have it make use of Mathew's fancy xarray based range
locking when that goes in, the semantics should be similar enough.

If people are ok with and willing to use my approach, I can polish it up - add
lockdep support and whatever else I can think of, and attempt to get rid of the
stupid recursive part.

But that's got to be decided first, where in the call stack the locking should
be done.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-20 22:45       ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
@ 2018-05-23 15:22         ` Christoph Hellwig
  2018-05-23 17:12           ` Kent Overstreet
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2018-05-23 15:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Matthew Wilcox, linux-kernel, linux-fsdevel,
	Andrew Morton, Dave Chinner, darrick.wong, tytso, linux-btrfs,
	clm, jbacik, viro, peterz

On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> merging bcachefs. Sorry, but that's a bit crazy.

It isn't crazy at all.  In general we expect people to do their fair
share of core work to get their pet feature in.  How much is acceptable
is a difficult question and not black and white.

But if you want to grow a critical core structure you better take a stab
at converting existing users.  Without that the tradeoff of growing
that core structure is simply not given.

Or to put it in words for this exact feature:  unless your new field
is also used by mainstream file systems it is not going to be added.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-23 15:22         ` Christoph Hellwig
@ 2018-05-23 17:12           ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-23 17:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, peterz

On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > > 
> > > Honestly I think this probably should be in the core.  But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> > 
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
> 
> It isn't crazy at all.  In general we expect people to do their fair
> share of core work to get their pet feature in.  How much is acceptable
> is a difficult question and not black and white.
> 
> But if you want to grow a critical core structure you better take a stab
> at converting existing users.  Without that the tradeoff of growing
> that core structure is simply not given.
> 
> Or to put it in words for this exact feature:  unless your new field
> is also used by mainstream file systems it is not going to be added.

Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.

When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock)
  2018-05-18 17:45       ` Kent Overstreet
@ 2018-05-23 23:55         ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2018-05-23 23:55 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik,
	viro, peterz, bcrl, Christoph Hellwig, zach.brown

On Fri, May 18, 2018 at 01:45:49PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > > Historically, the only problematic case has been direct IO, and people
> > > > have been willing to say "well, if you mix buffered and direct IO you
> > > > get what you deserve", and that's probably not unreasonable. But now we
> > > > have fallocate insert range and collapse range, and those are broken in
> > > > ways I frankly don't want to think about if they can't ensure consistency
> > > > with the page cache.
> > > 
> > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > > You may get pushback on the grounds that this ought to be a
> > > filesystem-specific lock rather than one embedded in the generic inode.
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I didn't know about i_mmap_sem, thanks
> 
> But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
> to block pagecache adds in bcachefs since the dio write could overwrite
> uncompressed data or a reservation with compressed data, which means new writes
> need a disk reservation.
> 
> Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
> for reads that are already cached, the generic buffered read path can do cached
> reads without taking any per inode locks - that's why I pushed the lock down to
> only be taken when we have to add a page to the pagecache.
> 
> Definitely less ugly doing it that way though...

More notes on locking design: Mathew, this is very relevant to any sort of range
locking, too.

There's some really tricky issues related to dio writes + page faults. If your
particular filesystem doesn't care about minor page cache consistency issues
caused by dio writes most of this may not be relevant to you, but I honestly
would find it a little hard to believe this isn't an issue for _any_ other
filesystems.

Current situation today, for most filesystems: top of the dio write path shoots
down the region of the pagecache for the file it's writing to, with
filemap_write_and_wait_range() then invalidate_inode_pages2_range().

This is racy though and does _not_ gurarantee page cache consistency, i.e. we
can end up in a situation where the write completes, but we have stale data -
and worse, potentially stale metadata, in the buffer heads or whatever your
filesystem uses.

Ok, solving that is easy enough, we need a lock dio writes can take that
prevents pages from being added to a mapping for their duration (or alternately,
range locks, but range locks can be viewed as just an optimization).

This explains my locking design - if you have a lock that can be taken for "add"
or "block", where multiple threads can take it for add or block, but it can't be
in both states simultaneously then it does what we want, you can have multiple
outstanding dio writes or multiple buffered IO operations bringing pages in and
it doesn't add much overhead.

This isn't enough, though.

 - calling filemap_write_and_wait_range then invalidate_inode_pages2_range can
   race - the call to invalidate_inode_pages2_range will fail if any of the
   pages have been redirtied, and we'll have stale pages in the page cache.

   The ideal solution for this would be a function to do both, that removes
   pages from the page cache atomically with clearing PageDirty and kicking off
   writeback. Alternately, you can have .page_mkwrite and the buffered write
   path take the pagecache add lock when they have to dirty a page, but that
   kind of sucks.

 - pagefaults, via gup()
   
   this is the really annoying one, if userspace does a dio write where the buf
   they're writing is mmapped and overlaps with the part of the file they're
   writing to, yay fun

   we call gup() after shooting down the part of the pagecache we're writing
   to, so gup() just faults it back in again and we're left with stale data in
   the page cache again.

   the generic dio code tries to deal with this these days by calling
   invalidate_inode_pages2_range() again after the dio write completes. Again
   though, invalidate_inode_pages2_range() will fail if the pages are dirty, and
   we're left with stale data in the page cache.

   I _think_ (haven't tried this yet) it ought to be possible to move the second
   call to invalidate_inode_pages2_range() to immediately after the call to
   gup() - this change wouldn't make sense in the current code without locking,
   but it would make sense if the dio write path is holding a pagecache add lock
   to prevent anything but its own faults via gup() from bringing pages back in.

   Also need to prevent the pages gup() faulted in from being dirtied until they
   can be removed again... 

   Also, one of the things my patch did was change filemap_fault() to not kick
   off readahead and only read in single pages if we're being called with
   pagecache block lock held (i.e. from dio write to the same file). I'm pretty
   sure this is unnecessary though, if I add the second
   invalidate_inode_pages2_range() call to my own dio code after gup() (which I
   believe is the correct solution anyways).

 - lock recursion

   my locking approach pushes down the locking to only when we're adding pages
   to the radix tree, unlike, I belive, the ext4/xfs approach.

   this means that dio write takes page_cache_block lock, and page faults take
   page_cache_add lock, so dio write -> gup -> fault is a recursive locking
   deadlock unless we do something about it.

   my solution was to add a pointer to a pagecache_lock to task_struct, so we
   can detect the recursion when we go to take pagecache_add lock. It's ugly,
   but I haven't thought of a better way to do it.

   I'm pretty sure that the xarray based range locking Matthew wants to do will
   hit the same issue, it's just inherent to pushing the locking down to where
   we manipulate the radix tree - which IMO we want to do, so that buffered
   reads/writes to cached data aren't taking these locks unnecessarily; those
   are the fast paths.

If anyone has any better ideas than what I've come up with, or sees any gaping
holes in my design, please speak up...

Even if you don't care about dio write consistency, having this locking in the
core VFS could mean converting truncate to use it, which I think would be a
major benefit too.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2018-05-23 23:55 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-18  7:48 [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
2018-05-18  7:49 ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
2018-05-18 13:13   ` Matthew Wilcox
2018-05-18 15:53     ` Christoph Hellwig
2018-05-18 17:45       ` Kent Overstreet
2018-05-23 23:55         ` Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock) Kent Overstreet
2018-05-20 22:45       ` [PATCH 01/10] mm: pagecache add lock Kent Overstreet
2018-05-23 15:22         ` Christoph Hellwig
2018-05-23 17:12           ` Kent Overstreet
2018-05-18  7:49 ` [PATCH 02/10] mm: export find_get_pages() Kent Overstreet
2018-05-18 16:00   ` Christoph Hellwig
2018-05-18  7:49 ` [PATCH 03/10] locking: bring back lglocks Kent Overstreet
2018-05-18  9:51   ` Peter Zijlstra
2018-05-18 10:13     ` Kent Overstreet
2018-05-18 11:03       ` Peter Zijlstra
2018-05-18 11:39         ` Kent Overstreet
2018-05-18  7:49 ` [PATCH 04/10] locking: export osq_lock()/osq_unlock() Kent Overstreet
2018-05-18  9:52   ` Peter Zijlstra
2018-05-18 10:18     ` Kent Overstreet
2018-05-18 11:08       ` Peter Zijlstra
2018-05-18 11:32         ` Kent Overstreet
2018-05-18 11:40           ` Peter Zijlstra
2018-05-18 12:40             ` Kent Overstreet
2018-05-18  7:49 ` [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily Kent Overstreet
2018-05-18 16:01   ` Christoph Hellwig
2018-05-18  7:49 ` [PATCH 06/10] Generic radix trees Kent Overstreet
2018-05-18 16:02   ` Christoph Hellwig
2018-05-18 17:38     ` Kent Overstreet
2018-05-18  7:49 ` [PATCH 07/10] bcache: optimize continue_at_nobarrier() Kent Overstreet
2018-05-18  7:49 ` [PATCH 08/10] bcache: move closures to lib/ Kent Overstreet
2018-05-18 16:02   ` Christoph Hellwig
2018-05-18  7:49 ` [PATCH 09/10] closures: closure_wait_event() Kent Overstreet
2018-05-18  7:49 ` [PATCH 10/10] Dynamic fault injection Kent Overstreet
2018-05-18 16:02   ` Christoph Hellwig
2018-05-18 17:37     ` Kent Overstreet
2018-05-18 19:05   ` Andreas Dilger
2018-05-18 19:10     ` Kent Overstreet
2018-05-18 20:54       ` Andreas Dilger
2018-05-18  7:55 ` [PATCH 00/10] RFC: assorted bcachefs patches Kent Overstreet
2018-05-18 17:45 ` Josef Bacik
2018-05-18 17:49   ` Kent Overstreet
2018-05-18 18:03     ` Josef Bacik
2018-05-18 18:28       ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).