All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/20] bcachefs prereqs patch series
@ 2023-07-12 21:10 Kent Overstreet
  2023-07-12 21:10 ` [PATCH 01/20] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
                   ` (19 more replies)
  0 siblings, 20 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:10 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet

Here's the latest iteration of the bcachefs prereqs patch series, now
hoping to go in for 6.6.

https://lore.kernel.org/linux-bcachefs/20230712195719.y4msidsr7suu55gl@moria.home.lan/T/

This is now slimmed down as much has I can. I've droped the lockdep
patches, at the cost of dropping lockdep support for btree node locks.
Six locks and mean_and_variance are now in fs/bcachefs/. My
copy_folio_from_iter_atomic() patch has been dropped in favor of the
iov_iter patch from Matthew, which he said he'd likely send for -rc4, so
will already be in by the time of the actual bcachefs pull request.

Christopher James Halse Rogers (1):
  stacktrace: Export stack_trace_save_tsk

Kent Overstreet (18):
  sched: Add task_struct->faults_disabled_mapping
  fs: factor out d_mark_tmpfile()
  block: Add some exports for bcachefs
  block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  block: Bring back zero_fill_bio_iter
  block: Don't block on s_umount from __invalidate_super()
  lib/string_helpers: string_get_size() now returns characters wrote
  lib: Export errname
  locking/osq: Export osq_(lock|unlock)
  bcache: move closures to lib/
  MAINTAINERS: Add entry for closures
  closures: closure_wait_event()
  closures: closure_nr_remaining()
  closures: Add a missing include
  MAINTAINERS: Add entry for generic-radix-tree
  lib/generic-radix-tree.c: Don't overflow in peek()
  lib/generic-radix-tree.c: Add a missing include
  lib/generic-radix-tree.c: Add peek_prev()

Matthew Wilcox (Oracle) (1):
  iov_iter: Handle compound highmem pages in
    copy_page_from_iter_atomic()

 MAINTAINERS                                   | 15 ++++
 block/bdev.c                                  |  2 +-
 block/bio.c                                   | 18 +++--
 block/blk-core.c                              |  1 +
 block/blk.h                                   |  1 -
 drivers/md/bcache/Kconfig                     | 10 +--
 drivers/md/bcache/Makefile                    |  4 +-
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 fs/dcache.c                                   | 12 ++-
 fs/super.c                                    | 40 +++++++---
 include/linux/bio.h                           |  7 +-
 include/linux/blkdev.h                        |  1 +
 .../md/bcache => include/linux}/closure.h     | 46 ++++++++---
 include/linux/dcache.h                        |  1 +
 include/linux/fs.h                            |  1 +
 include/linux/generic-radix-tree.h            | 68 ++++++++++++++++-
 include/linux/sched.h                         |  1 +
 include/linux/string_helpers.h                |  4 +-
 init/init_task.c                              |  1 +
 kernel/locking/osq_lock.c                     |  2 +
 kernel/stacktrace.c                           |  2 +
 lib/Kconfig                                   |  3 +
 lib/Kconfig.debug                             |  9 +++
 lib/Makefile                                  |  2 +
 {drivers/md/bcache => lib}/closure.c          | 36 +++++----
 lib/errname.c                                 |  1 +
 lib/generic-radix-tree.c                      | 76 ++++++++++++++++++-
 lib/iov_iter.c                                | 43 +++++++----
 lib/string_helpers.c                          | 10 ++-
 31 files changed, 333 insertions(+), 90 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (93%)
 rename {drivers/md/bcache => lib}/closure.c (88%)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/20] sched: Add task_struct->faults_disabled_mapping
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
@ 2023-07-12 21:10 ` Kent Overstreet
  2023-07-12 21:10 ` [PATCH 02/20] fs: factor out d_mark_tmpfile() Kent Overstreet
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:10 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet, Jan Kara, Darrick J . Wong,
	Andreas Grünbacher

From: Kent Overstreet <kent.overstreet@gmail.com>

There has been a long standing page cache coherence bug with direct IO.
This provides part of a mechanism to fix it, currently just used by
bcachefs but potentially worth promoting to the VFS.

Direct IO evicts the range of the pagecache being read or written to.

For reads, we need dirty pages to be written to disk, so that the read
doesn't return stale data. For writes, we need to evict that range of
the pagecache so that it's not stale after the write completes.

However, without a locking mechanism to prevent those pages from being
re-added to the pagecache - by a buffered read or page fault - page
cache inconsistency is still possible.

This isn't necessarily just an issue for userspace when they're playing
games; filesystems may hang arbitrary state off the pagecache, and so
page cache inconsistency may cause real filesystem bugs, depending on
the filesystem. This is less of an issue for iomap based filesystems,
but e.g. buffer heads caches disk block mappings (!) and attaches them
to the pagecache, and bcachefs attaches disk reservations to pagecache
pages.

This issue has been hard to fix, because
 - we need to add a lock (henceforth calld pagecache_add_lock), which
   would be held for the duration of the direct IO
 - page faults add pages to the page cache, thus need to take the same
   lock
 - dio -> gup -> page fault thus can deadlock

And we cannot enforce a lock ordering with this lock, since userspace
will be controlling the lock ordering (via the fd and buffer arguments
to direct IOs), so we need a different method of deadlock avoidance.

We need to tell the page fault handler that we're already holding a
pagecache_add_lock, and since plumbing it through the entire gup() path
would be highly impractical this adds a field to task_struct.

Then the full method is:
 - in the dio path, when we take first pagecache_add_lock, note the
   mapping in task_struct
 - in the page fault handler, if faults_disabled_mapping is set, we
   check if it's the same mapping as the one taking a page fault for,
   and if so return an error.

   Then we check lock ordering: if there's a lock ordering violation and
   trylock fails, we'll have to cycle the locks and return an error that
   tells the DIO path to retry: faults_disabled_mapping is also used for
   signalling "locks were dropped, please retry".

Also relevant to this patch: mapping->invalidate_lock.
mapping->invalidate_lock provides most of the required semantics - it's
used by truncate/fallocate to block pages being added to the pagecache.
However, since it's a rwsem, direct IOs would need to take the write
side in order to block page cache adds, and would then be exclusive with
each other - we'll need a new type of lock to pair with this approach.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jan Kara <jack@suse.cz>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Andreas Grünbacher <andreas.gruenbacher@gmail.com>
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d..bc7b61305c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -871,6 +871,7 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	int				exit_state;
 	int				exit_code;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe..f703116e05 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -85,6 +85,7 @@ struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/20] fs: factor out d_mark_tmpfile()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
  2023-07-12 21:10 ` [PATCH 01/20] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
@ 2023-07-12 21:10 ` Kent Overstreet
  2023-07-12 21:10 ` [PATCH 03/20] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Kent Overstreet
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:10 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet, Alexander Viro, Christian Brauner

From: Kent Overstreet <kent.overstreet@gmail.com>

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/dcache.c            | 12 ++++++++++--
 include/linux/dcache.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5fdab..dbdafa2617 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct file *file, struct inode *inode)
+void d_mark_tmpfile(struct file *file, struct inode *inode)
 {
 	struct dentry *dentry = file->f_path.dentry;
 
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct file *file, struct inode *inode)
+{
+	struct dentry *dentry = file->f_path.dentry;
+
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(file, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b351e009f..3da2f0545d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct file *, struct inode *);
 extern void d_tmpfile(struct file *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/20] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
  2023-07-12 21:10 ` [PATCH 01/20] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
  2023-07-12 21:10 ` [PATCH 02/20] fs: factor out d_mark_tmpfile() Kent Overstreet
@ 2023-07-12 21:10 ` Kent Overstreet
  2023-07-12 21:10 ` [PATCH 04/20] block: Add some exports for bcachefs Kent Overstreet
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:10 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Matthew Wilcox (Oracle), Kent Overstreet

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

copy_page_from_iter_atomic() already handles !highmem compound
pages correctly, but if we are passed a highmem compound page,
each base page needs to be mapped & unmapped individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/iov_iter.c | 43 ++++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 960223ed91..f9c4bba272 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -857,24 +857,37 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+		size_t bytes, struct iov_iter *i)
 {
-	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
+	size_t n, copied = 0;
+
+	if (!page_copy_sane(page, offset, bytes))
 		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
+	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
-	}
-	iterate_and_advance(i, bytes, base, len, off,
-		copyin(p + off, base, len),
-		memcpy_from_iter(i, p + off, base, len)
-	)
-	kunmap_atomic(kaddr);
-	return bytes;
+
+	do {
+		char *p;
+
+		n = bytes - copied;
+		if (PageHighMem(page)) {
+			page += offset / PAGE_SIZE;
+			offset %= PAGE_SIZE;
+			n = min_t(size_t, n, PAGE_SIZE - offset);
+		}
+
+		p = kmap_atomic(page) + offset;
+		iterate_and_advance(i, n, base, len, off,
+			copyin(p + off, base, len),
+			memcpy_from_iter(i, p + off, base, len)
+		)
+		kunmap_atomic(p);
+		copied += n;
+		offset += n;
+	} while (PageHighMem(page) && copied != bytes && n > 0);
+
+	return copied;
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (2 preceding siblings ...)
  2023-07-12 21:10 ` [PATCH 03/20] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Kent Overstreet
@ 2023-07-12 21:10 ` Kent Overstreet
  2023-07-24 17:31   ` Christoph Hellwig
  2023-07-25  2:59   ` Matthew Wilcox
  2023-07-12 21:11 ` [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:10 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, linux-block, Jens Axboe, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

 - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
 - blk_status_to_str() - error messages
 - bio_add_folio() - this should definitely be exported for everyone,
   it's the modern version of bio_add_page()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/bio.c            | 2 ++
 block/blk-core.c       | 1 +
 block/blk.h            | 1 -
 include/linux/blkdev.h | 1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 043944fd46..1e75840d17 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1481,6 +1481,7 @@ void bio_set_pages_dirty(struct bio *bio)
 			set_page_dirty_lock(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
 
 /*
  * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
@@ -1540,6 +1541,7 @@ void bio_check_pages_dirty(struct bio *bio)
 	spin_unlock_irqrestore(&bio_dirty_lock, flags);
 	schedule_work(&bio_dirty_work);
 }
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
 static inline bool bio_remaining_done(struct bio *bio)
 {
diff --git a/block/blk-core.c b/block/blk-core.c
index 1da77e7d62..b7b0237c36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -205,6 +205,7 @@ const char *blk_status_to_str(blk_status_t status)
 		return "<null>";
 	return blk_errors[idx].name;
 }
+EXPORT_SYMBOL_GPL(blk_status_to_str);
 
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
diff --git a/block/blk.h b/block/blk.h
index 45547bcf11..f20f9ca03e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -251,7 +251,6 @@ static inline void bio_integrity_free(struct bio *bio)
 
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
-const char *blk_status_to_str(blk_status_t status);
 
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0ffe203a6..7a32dc98e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -854,6 +854,7 @@ extern const char *blk_op_str(enum req_op op);
 
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
+const char *blk_status_to_str(blk_status_t status);
 
 /* only poll the hardware once, don't continue until a completion was found */
 #define BLK_POLL_ONESHOT		(1 << 0)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (3 preceding siblings ...)
  2023-07-12 21:10 ` [PATCH 04/20] block: Add some exports for bcachefs Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-24 17:34   ` Christoph Hellwig
  2023-07-12 21:11 ` [PATCH 06/20] block: Bring back zero_fill_bio_iter Kent Overstreet
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Jens Axboe, linux-block

bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
---
 block/bio.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1e75840d17..e74a04ea14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,7 +1245,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset, trim;
+	size_t offset;
 	int ret = 0;
 
 	/*
@@ -1274,10 +1274,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-	iov_iter_revert(iter, trim);
+	if (bio->bi_bdev) {
+		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+		iov_iter_revert(iter, trim);
+		size -= trim;
+	}
 
-	size -= trim;
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/20] block: Bring back zero_fill_bio_iter
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (4 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-24 17:35   ` Christoph Hellwig
  2023-07-12 21:11 ` [PATCH 07/20] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet, Jens Axboe, linux-block

From: Kent Overstreet <kent.overstreet@gmail.com>

This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is
used by bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
---
 block/bio.c         | 6 +++---
 include/linux/bio.h | 7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e74a04ea14..70b5c987bc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -606,15 +606,15 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bio_kmalloc);
 
-void zero_fill_bio(struct bio *bio)
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
 {
 	struct bio_vec bv;
 	struct bvec_iter iter;
 
-	bio_for_each_segment(bv, bio, iter)
+	__bio_for_each_segment(bv, bio, iter, start)
 		memzero_bvec(&bv);
 }
-EXPORT_SYMBOL(zero_fill_bio);
+EXPORT_SYMBOL(zero_fill_bio_iter);
 
 /**
  * bio_truncate - truncate the bio to small size of @new_size
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b3e7529ff5..f2620f8d18 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,12 @@ extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 extern void bio_copy_data(struct bio *dst, struct bio *src);
 extern void bio_free_pages(struct bio *bio);
 void guard_bio_eod(struct bio *bio);
-void zero_fill_bio(struct bio *bio);
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter);
+
+static inline void zero_fill_bio(struct bio *bio)
+{
+	zero_fill_bio_iter(bio, bio->bi_iter);
+}
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/20] block: Don't block on s_umount from __invalidate_super()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (5 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 06/20] block: Bring back zero_fill_bio_iter Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 08/20] stacktrace: Export stack_trace_save_tsk Kent Overstreet
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Christian Brauner, Jens Axboe

__invalidate_super() is used to flush any filesystem mounted on a
device, generally on some sort of media change event.

However, when unmounting a filesystem and closing the underlying block
devices, we can deadlock if the block driver then calls
__invalidate_device() (e.g. because the block device goes away when it
is no longer in use).

This happens with bcachefs on top of loopback, and can be triggered by
fstests generic/042:

  put_super
    -> blkdev_put
    -> lo_release
    -> disk_force_media_change
    -> __invalidate_device
    -> get_super

This isn't inherently specific to bcachefs - it hasn't shown up with
other filesystems before because most other filesystems use the sget()
mechanism for opening/closing block devices (and enforcing exclusion),
however sget() has its own downsides and weird/sketchy behaviour w.r.t.
block device open lifetime - if that ever gets fixed more code will run
into this issue.

The __invalidate_device() call here is really a best effort "I just
yanked the device for a mounted filesystem, please try not to lose my
data" - if it's ever actually needed the user has already done something
crazy, and we probably shouldn't make things worse by deadlocking.
Switching to a trylock seems in keeping with what the code is trying to
do.

If we ever get revoke() at the block layer, perhaps we would look at
rearchitecting to use that instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/bdev.c       |  2 +-
 fs/super.c         | 40 +++++++++++++++++++++++++++++++---------
 include/linux/fs.h |  1 +
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 21c63bfef3..a4d7e8732c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -934,7 +934,7 @@ EXPORT_SYMBOL(lookup_bdev);
 
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
-	struct super_block *sb = get_super(bdev);
+	struct super_block *sb = try_get_super(bdev);
 	int res = 0;
 
 	if (sb) {
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7d..a2decce02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -791,14 +791,7 @@ void iterate_supers_type(struct file_system_type *type,
 
 EXPORT_SYMBOL(iterate_supers_type);
 
-/**
- * get_super - get the superblock of a device
- * @bdev: device to get the superblock for
- *
- * Scans the superblock list and finds the superblock of the file system
- * mounted on the device given. %NULL is returned if no match is found.
- */
-struct super_block *get_super(struct block_device *bdev)
+static struct super_block *__get_super(struct block_device *bdev, bool try)
 {
 	struct super_block *sb;
 
@@ -813,7 +806,12 @@ struct super_block *get_super(struct block_device *bdev)
 		if (sb->s_bdev == bdev) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
-			down_read(&sb->s_umount);
+
+			if (!try)
+				down_read(&sb->s_umount);
+			else if (!down_read_trylock(&sb->s_umount))
+				return NULL;
+
 			/* still alive? */
 			if (sb->s_root && (sb->s_flags & SB_BORN))
 				return sb;
@@ -828,6 +826,30 @@ struct super_block *get_super(struct block_device *bdev)
 	return NULL;
 }
 
+/**
+ * get_super - get the superblock of a device
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, false);
+}
+
+/**
+ * try_get_super - get the superblock of a device, using trylock on sb->s_umount
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *try_get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, true);
+}
+
 /**
  * get_active_super - get an active reference to the superblock of a device
  * @bdev: device to get the superblock for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 133f0640fb..bd5105bc92 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2897,6 +2897,7 @@ extern struct file_system_type *get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
+extern struct super_block *try_get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/20] stacktrace: Export stack_trace_save_tsk
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (6 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 07/20] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 09/20] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Christopher James Halse Rogers, Kent Overstreet

From: Christopher James Halse Rogers <raof@ubuntu.com>

The bcachefs module wants it, and there doesn't seem to be any
reason it shouldn't be exported like the other functions.

Signed-off-by: Christopher James Halse Rogers <raof@ubuntu.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 kernel/stacktrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9ed5ce9894..4f65824879 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
 	put_task_stack(tsk);
 	return c.len;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
@@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
 	save_stack_trace_tsk(task, &trace);
 	return trace.nr_entries;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/20] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (7 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 08/20] stacktrace: Export stack_trace_save_tsk Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 10/20] lib: Export errname Kent Overstreet
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

printbuf now needs to know the number of characters that would have been
written if the buffer was too small, like snprintf(); this changes
string_get_size() to return the the return value of snprintf().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/string_helpers.h |  4 ++--
 lib/string_helpers.c           | 10 ++++++----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index fae6beaaa2..44148f8feb 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -23,8 +23,8 @@ enum string_size_units {
 	STRING_UNITS_2,		/* use binary powers of 2^10 */
 };
 
-void string_get_size(u64 size, u64 blk_size, enum string_size_units units,
-		     char *buf, int len);
+int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+		    char *buf, int len);
 
 int parse_int_array_user(const char __user *from, size_t count, int **array);
 
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e0..b9a34eb386 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -31,9 +31,11 @@
  * giving the size in the required units.  @buf should have room for
  * at least 9 bytes and will always be zero terminated.
  *
+ * Return value: number of characters of output that would have been written
+ * (which may be greater than len, if output was truncated).
  */
-void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
-		     char *buf, int len)
+int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+		    char *buf, int len)
 {
 	static const char *const units_10[] = {
 		"B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
@@ -126,8 +128,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
 	else
 		unit = units_str[units][i];
 
-	snprintf(buf, len, "%u%s %s", (u32)size,
-		 tmp, unit);
+	return snprintf(buf, len, "%u%s %s", (u32)size,
+			tmp, unit);
 }
 EXPORT_SYMBOL(string_get_size);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/20] lib: Export errname
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (8 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 09/20] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-13  7:10   ` Eric Biggers
  2023-07-12 21:11 ` [PATCH 11/20] locking/osq: Export osq_(lock|unlock) Kent Overstreet
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Christopher James Halse Rogers

errname() returns the name of an errcode; this functionality is
otherwise only available for error pointers via %pE - bcachefs uses this
for better error messages.

Signed-off-by: Christopher James Halse Rogers <raof@ubuntu.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/errname.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/errname.c b/lib/errname.c
index 67739b174a..dd1b998552 100644
--- a/lib/errname.c
+++ b/lib/errname.c
@@ -228,3 +228,4 @@ const char *errname(int err)
 
 	return err > 0 ? name + 1 : name;
 }
+EXPORT_SYMBOL(errname);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (9 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 10/20] lib: Export errname Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-08-02 20:16   ` Waiman Long
  2023-07-12 21:11 ` [PATCH 12/20] bcache: move closures to lib/ Kent Overstreet
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Peter Zijlstra, Ingo Molnar, Waiman Long, Boqun Feng

These are used by bcachefs's six locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/locking/osq_lock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b..b752ec5cc6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
 	return false;
 }
+EXPORT_SYMBOL_GPL(osq_lock);
 
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
@@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	if (next)
 		WRITE_ONCE(next->locked, 1);
 }
+EXPORT_SYMBOL_GPL(osq_unlock);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/20] bcache: move closures to lib/
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (10 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 11/20] locking/osq: Export osq_(lock|unlock) Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-13  3:21   ` Randy Dunlap
  2023-07-12 21:11 ` [PATCH 13/20] MAINTAINERS: Add entry for closures Kent Overstreet
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet, Coly Li

From: Kent Overstreet <kent.overstreet@gmail.com>

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/Kconfig                     | 10 +-----
 drivers/md/bcache/Makefile                    |  4 +--
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 .../md/bcache => include/linux}/closure.h     | 17 +++++----
 lib/Kconfig                                   |  3 ++
 lib/Kconfig.debug                             |  9 +++++
 lib/Makefile                                  |  2 ++
 {drivers/md/bcache => lib}/closure.c          | 35 +++++++++----------
 10 files changed, 43 insertions(+), 43 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (88%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 529c9d04e9..b2d10063d3 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -4,6 +4,7 @@ config BCACHE
 	tristate "Block device as cache"
 	select BLOCK_HOLDER_DEPRECATED if SYSFS
 	select CRC64
+	select CLOSURES
 	help
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -19,15 +20,6 @@ config BCACHE_DEBUG
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
 
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	help
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
-
 config BCACHE_ASYNC_REGISTRATION
 	bool "Asynchronous device registration"
 	depends on BCACHE
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676..054e8a33a7 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,6 +2,6 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
 	util.o writeback.o features.o
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index aebb7ef10e..c8b4914ad8 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -179,6 +179,7 @@
 #define pr_fmt(fmt) "bcache: %s() " fmt, __func__
 
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -192,7 +193,6 @@
 #include "bcache_ondisk.h"
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 7e9d19fd21..35c701d542 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2911,7 +2911,6 @@ static int __init bcache_init(void)
 		goto err;
 
 	bch_debug_init();
-	closure_debug_init();
 
 	bcache_is_reboot = false;
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 6f3cb7c921..f61ab1bada 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
 #include <linux/workqueue.h>
 #include <linux/crc64.h>
 
-#include "closure.h"
-
 struct closure;
 
 #ifdef CONFIG_BCACHE_DEBUG
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index c88cdc4ae4..0ec9e7bc8d 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		BUG_ON(!queue_work(wq, &cl->work));
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index 5c2da561c5..f78bc8b425 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -505,6 +505,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc68..3ee25d5dae 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1637,6 +1637,15 @@ config DEBUG_NOTIFIERS
 	  This is a relatively cheap check but if you care about maximum
 	  performance, say N.
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	help
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config BUG_ON_DATA_CORRUPTION
 	bool "Trigger a BUG when data corruption is detected"
 	select DEBUG_LIST
diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae3..7798910135 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -248,6 +248,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
 
 obj-$(CONFIG_GLOB) += glob.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 88%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index d8d9394a6b..b38ded00b9 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -45,6 +44,7 @@ void closure_sub(struct closure *cl, int v)
 {
 	closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
 }
+EXPORT_SYMBOL(closure_sub);
 
 /*
  * closure_put - decrement a closure's refcount
@@ -53,6 +53,7 @@ void closure_put(struct closure *cl)
 {
 	closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
 }
+EXPORT_SYMBOL(closure_put);
 
 /*
  * closure_wake_up - wake up all closures on a wait list, without memory barrier
@@ -74,6 +75,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
 		closure_sub(cl, CLOSURE_WAITING + 1);
 	}
 }
+EXPORT_SYMBOL(__closure_wake_up);
 
 /**
  * closure_wait - add a closure to a waitlist
@@ -93,6 +95,7 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
 
 	return true;
 }
+EXPORT_SYMBOL(closure_wait);
 
 struct closure_syncer {
 	struct task_struct	*task;
@@ -127,8 +130,9 @@ void __sched __closure_sync(struct closure *cl)
 
 	__set_current_state(TASK_RUNNING);
 }
+EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -144,6 +148,7 @@ void closure_debug_create(struct closure *cl)
 	list_add(&cl->all, &closure_list);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
+EXPORT_SYMBOL(closure_debug_create);
 
 void closure_debug_destroy(struct closure *cl)
 {
@@ -156,8 +161,7 @@ void closure_debug_destroy(struct closure *cl)
 	list_del(&cl->all);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
-
-static struct dentry *closure_debug;
+EXPORT_SYMBOL(closure_debug_destroy);
 
 static int debug_show(struct seq_file *f, void *data)
 {
@@ -181,7 +185,7 @@ static int debug_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pS\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -190,18 +194,11 @@ static int debug_show(struct seq_file *f, void *data)
 
 DEFINE_SHOW_ATTRIBUTE(debug);
 
-void  __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
-	if (!IS_ERR_OR_NULL(bcache_debug))
-		/*
-		 * it is unnecessary to check return value of
-		 * debugfs_create_file(), we should not care
-		 * about this.
-		 */
-		closure_debug = debugfs_create_file(
-			"closures", 0400, bcache_debug, NULL, &debug_fops);
+	debugfs_create_file("closures", 0400, NULL, NULL, &debug_fops);
+	return 0;
 }
-#endif
+late_initcall(closure_debug_init)
 
-MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
-MODULE_LICENSE("GPL");
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/20] MAINTAINERS: Add entry for closures
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (11 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 12/20] bcache: move closures to lib/ Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 14/20] closures: closure_wait_event() Kent Overstreet
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet, Coly Li

closures, from bcache, are async widgets with a variety of uses.
bcachefs also uses them, so they're being moved to lib/; mark them as
maintained.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 35e1959464..314b55ecd1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5064,6 +5064,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
 F:	Documentation/devicetree/bindings/timer/
 F:	drivers/clocksource/
 
+CLOSURES
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/closure.h
+F:	lib/closure.c
+
 CMPC ACPI DRIVER
 M:	Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
 M:	Daniel Oliveira Nascimento <don@syst.com.br>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/20] closures: closure_wait_event()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (12 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 13/20] MAINTAINERS: Add entry for closures Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 15/20] closures: closure_nr_remaining() Kent Overstreet
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Coly Li, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

Like wait_event() - except, because it uses closures and closure
waitlists it doesn't have the restriction on modifying task state inside
the condition check, like wait_event() does.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 0ec9e7bc8d..36b4a83f9b 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -374,4 +374,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/20] closures: closure_nr_remaining()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (13 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 14/20] closures: closure_wait_event() Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 16/20] closures: Add a missing include Kent Overstreet
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet

Factor out a new helper, which returns the number of events outstanding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/closure.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 36b4a83f9b..722a586bb2 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -172,6 +172,11 @@ void __closure_wake_up(struct closure_waitlist *list);
 bool closure_wait(struct closure_waitlist *list, struct closure *cl);
 void __closure_sync(struct closure *cl);
 
+static inline unsigned closure_nr_remaining(struct closure *cl)
+{
+	return atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK;
+}
+
 /**
  * closure_sync - sleep until a closure a closure has nothing left to wait on
  *
@@ -180,7 +185,7 @@ void __closure_sync(struct closure *cl);
  */
 static inline void closure_sync(struct closure *cl)
 {
-	if ((atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK) != 1)
+	if (closure_nr_remaining(cl) != 1)
 		__closure_sync(cl);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/20] closures: Add a missing include
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (14 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 15/20] closures: closure_nr_remaining() Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 17/20] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet

Fixes building in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/closure.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/closure.c b/lib/closure.c
index b38ded00b9..0855e698ce 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -9,6 +9,7 @@
 #include <linux/closure.h>
 #include <linux/debugfs.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 17/20] MAINTAINERS: Add entry for generic-radix-tree
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (15 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 16/20] closures: Add a missing include Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 18/20] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet

lib/generic-radix-tree.c is a simple radix tree that supports storing
arbitrary types. Add a maintainers entry for it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 314b55ecd1..c3fd29247b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8670,6 +8670,13 @@ F:	Documentation/devicetree/bindings/power/power?domain*
 F:	drivers/base/power/domain*.c
 F:	include/linux/pm_domain.h
 
+GENERIC RADIX TREE
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/generic-radix-tree.h
+F:	lib/generic-radix-tree.c
+
 GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
 M:	Eugen Hristev <eugen.hristev@microchip.com>
 L:	linux-input@vger.kernel.org
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 18/20] lib/generic-radix-tree.c: Don't overflow in peek()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (16 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 17/20] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-12 21:11 ` [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
  2023-07-12 21:11 ` [PATCH 20/20] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

When we started spreading new inode numbers throughout most of the 64
bit inode space, that triggered some corner case bugs, in particular
some integer overflows related to the radix tree code. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h |  6 ++++++
 lib/generic-radix-tree.c           | 17 ++++++++++++++---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 107613f7d7..63080822dc 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -184,6 +184,12 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
 {
+	if (iter->offset + obj_size < iter->offset) {
+		iter->offset	= SIZE_MAX;
+		iter->pos	= SIZE_MAX;
+		return;
+	}
+
 	iter->offset += obj_size;
 
 	if (!is_power_of_2(obj_size) &&
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index f25eb111c0..7dfa88282b 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -166,6 +166,10 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 	struct genradix_root *r;
 	struct genradix_node *n;
 	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
 restart:
 	r = READ_ONCE(radix->root);
 	if (!r)
@@ -184,10 +188,17 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 			(GENRADIX_ARY - 1);
 
 		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			if (iter->offset + objs_per_ptr < iter->offset) {
+				iter->offset	= SIZE_MAX;
+				iter->pos	= SIZE_MAX;
+				return NULL;
+			}
+
 			i++;
-			iter->offset = round_down(iter->offset +
-					   genradix_depth_size(level),
-					   genradix_depth_size(level));
+			iter->offset = round_down(iter->offset + objs_per_ptr,
+						  objs_per_ptr);
 			iter->pos = (iter->offset >> PAGE_SHIFT) *
 				objs_per_page;
 			if (i == GENRADIX_ARY)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (17 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 18/20] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  2023-07-25  3:04   ` Matthew Wilcox
  2023-07-12 21:11 ` [PATCH 20/20] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
  19 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

We now need linux/limits.h for SIZE_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 63080822dc..f6cd0f909d 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -38,6 +38,7 @@
 
 #include <asm/page.h>
 #include <linux/bug.h>
+#include <linux/limits.h>
 #include <linux/log2.h>
 #include <linux/math.h>
 #include <linux/types.h>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 20/20] lib/generic-radix-tree.c: Add peek_prev()
  2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
                   ` (18 preceding siblings ...)
  2023-07-12 21:11 ` [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
@ 2023-07-12 21:11 ` Kent Overstreet
  19 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-12 21:11 UTC (permalink / raw)
  To: linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <kent.overstreet@gmail.com>

This patch adds genradix_peek_prev(), genradix_iter_rewind(), and
genradix_for_each_reverse(), for iterating backwards over a generic
radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/generic-radix-tree.h | 61 +++++++++++++++++++++++++++++-
 lib/generic-radix-tree.c           | 59 +++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index f6cd0f909d..c74b737699 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -117,6 +117,11 @@ static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
 
 #define __genradix_cast(_radix)		(typeof((_radix)->type[0]) *)
 #define __genradix_obj_size(_radix)	sizeof((_radix)->type[0])
+#define __genradix_objs_per_page(_radix)			\
+	(PAGE_SIZE / sizeof((_radix)->type[0]))
+#define __genradix_page_remainder(_radix)			\
+	(PAGE_SIZE % sizeof((_radix)->type[0]))
+
 #define __genradix_idx_to_offset(_radix, _idx)			\
 	__idx_to_offset(_idx, __genradix_obj_size(_radix))
 
@@ -180,7 +185,25 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 #define genradix_iter_peek(_iter, _radix)			\
 	(__genradix_cast(_radix)				\
 	 __genradix_iter_peek(_iter, &(_radix)->tree,		\
-			      PAGE_SIZE / __genradix_obj_size(_radix)))
+			__genradix_objs_per_page(_radix)))
+
+void *__genradix_iter_peek_prev(struct genradix_iter *, struct __genradix *,
+				size_t, size_t);
+
+/**
+ * genradix_iter_peek - get first entry at or below iterator's current
+ *			position
+ * @_iter:	a genradix_iter
+ * @_radix:	genradix being iterated over
+ *
+ * If no more entries exist at or below @_iter's current position, returns NULL
+ */
+#define genradix_iter_peek_prev(_iter, _radix)			\
+	(__genradix_cast(_radix)				\
+	 __genradix_iter_peek_prev(_iter, &(_radix)->tree,	\
+			__genradix_objs_per_page(_radix),	\
+			__genradix_obj_size(_radix) +		\
+			__genradix_page_remainder(_radix)))
 
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
@@ -203,6 +226,25 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_iter_advance(_iter, _radix)			\
 	__genradix_iter_advance(_iter, __genradix_obj_size(_radix))
 
+static inline void __genradix_iter_rewind(struct genradix_iter *iter,
+					  size_t obj_size)
+{
+	if (iter->offset == 0 ||
+	    iter->offset == SIZE_MAX) {
+		iter->offset = SIZE_MAX;
+		return;
+	}
+
+	if ((iter->offset & (PAGE_SIZE - 1)) == 0)
+		iter->offset -= PAGE_SIZE % obj_size;
+
+	iter->offset -= obj_size;
+	iter->pos--;
+}
+
+#define genradix_iter_rewind(_iter, _radix)			\
+	__genradix_iter_rewind(_iter, __genradix_obj_size(_radix))
+
 #define genradix_for_each_from(_radix, _iter, _p, _start)	\
 	for (_iter = genradix_iter_init(_radix, _start);	\
 	     (_p = genradix_iter_peek(&_iter, _radix)) != NULL;	\
@@ -220,6 +262,23 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_for_each(_radix, _iter, _p)			\
 	genradix_for_each_from(_radix, _iter, _p, 0)
 
+#define genradix_last_pos(_radix)				\
+	(SIZE_MAX / PAGE_SIZE * __genradix_objs_per_page(_radix) - 1)
+
+/**
+ * genradix_for_each_reverse - iterate over entry in a genradix, reverse order
+ * @_radix:	genradix to iterate over
+ * @_iter:	a genradix_iter to track current position
+ * @_p:		pointer to genradix entry type
+ *
+ * On every iteration, @_p will point to the current entry, and @_iter.pos
+ * will be the current entry's index.
+ */
+#define genradix_for_each_reverse(_radix, _iter, _p)		\
+	for (_iter = genradix_iter_init(_radix,	genradix_last_pos(_radix));\
+	     (_p = genradix_iter_peek_prev(&_iter, _radix)) != NULL;\
+	     genradix_iter_rewind(&_iter, _radix))
+
 int __genradix_prealloc(struct __genradix *, size_t, gfp_t);
 
 /**
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index 7dfa88282b..41f1bcdc44 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -1,4 +1,5 @@
 
+#include <linux/atomic.h>
 #include <linux/export.h>
 #include <linux/generic-radix-tree.h>
 #include <linux/gfp.h>
@@ -212,6 +213,64 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 }
 EXPORT_SYMBOL(__genradix_iter_peek);
 
+void *__genradix_iter_peek_prev(struct genradix_iter *iter,
+				struct __genradix *radix,
+				size_t objs_per_page,
+				size_t obj_size_plus_page_remainder)
+{
+	struct genradix_root *r;
+	struct genradix_node *n;
+	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
+restart:
+	r = READ_ONCE(radix->root);
+	if (!r)
+		return NULL;
+
+	n	= genradix_root_to_node(r);
+	level	= genradix_root_to_depth(r);
+
+	if (ilog2(iter->offset) >= genradix_depth_shift(level)) {
+		iter->offset = genradix_depth_size(level);
+		iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+		iter->offset -= obj_size_plus_page_remainder;
+		iter->pos--;
+	}
+
+	while (level) {
+		level--;
+
+		i = (iter->offset >> genradix_depth_shift(level)) &
+			(GENRADIX_ARY - 1);
+
+		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			iter->offset = round_down(iter->offset, objs_per_ptr);
+			iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+			if (!iter->offset)
+				return NULL;
+
+			iter->offset -= obj_size_plus_page_remainder;
+			iter->pos--;
+
+			if (!i)
+				goto restart;
+			--i;
+		}
+
+		n = n->children[i];
+	}
+
+	return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek_prev);
+
 static void genradix_free_recurse(struct genradix_node *n, unsigned level)
 {
 	if (level) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/20] bcache: move closures to lib/
  2023-07-12 21:11 ` [PATCH 12/20] bcache: move closures to lib/ Kent Overstreet
@ 2023-07-13  3:21   ` Randy Dunlap
  2023-07-13  3:52     ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Randy Dunlap @ 2023-07-13  3:21 UTC (permalink / raw)
  To: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Coly Li

Hi,

LGTM.
I have a couple of small nits below...

On 7/12/23 14:11, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> Prep work for bcachefs - being a fork of bcache it also uses closures
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Acked-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/Kconfig                     | 10 +-----
>  drivers/md/bcache/Makefile                    |  4 +--
>  drivers/md/bcache/bcache.h                    |  2 +-
>  drivers/md/bcache/super.c                     |  1 -
>  drivers/md/bcache/util.h                      |  3 +-
>  .../md/bcache => include/linux}/closure.h     | 17 +++++----
>  lib/Kconfig                                   |  3 ++
>  lib/Kconfig.debug                             |  9 +++++
>  lib/Makefile                                  |  2 ++
>  {drivers/md/bcache => lib}/closure.c          | 35 +++++++++----------
>  10 files changed, 43 insertions(+), 43 deletions(-)
>  rename {drivers/md/bcache => include/linux}/closure.h (97%)
>  rename {drivers/md/bcache => lib}/closure.c (88%)
> 

> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index ce51d4dc68..3ee25d5dae 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1637,6 +1637,15 @@ config DEBUG_NOTIFIERS
>  	  This is a relatively cheap check but if you care about maximum
>  	  performance, say N.
>  
> +config DEBUG_CLOSURES
> +	bool "Debug closures (bcache async widgits)"

	                                   widgets

> +	depends on CLOSURES
> +	select DEBUG_FS
> +	help
> +	Keeps all active closures in a linked list and provides a debugfs
> +	interface to list them, which makes it possible to see asynchronous
> +	operations that get stuck.

Indent those 3 help text lines with 2 additional spaces, please,
as documented and as is done in (most of) the rest of this file.

With those fixed:

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

thanks.
-- 
~Randy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/20] bcache: move closures to lib/
  2023-07-13  3:21   ` Randy Dunlap
@ 2023-07-13  3:52     ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-13  3:52 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet, Coly Li

On Wed, Jul 12, 2023 at 08:21:05PM -0700, Randy Dunlap wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > +	depends on CLOSURES
> > +	select DEBUG_FS
> > +	help
> > +	Keeps all active closures in a linked list and provides a debugfs
> > +	interface to list them, which makes it possible to see asynchronous
> > +	operations that get stuck.
> 
> Indent those 3 help text lines with 2 additional spaces, please,
> as documented and as is done in (most of) the rest of this file.
> 
> With those fixed:
> 
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Ack

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/20] lib: Export errname
  2023-07-12 21:11 ` [PATCH 10/20] lib: Export errname Kent Overstreet
@ 2023-07-13  7:10   ` Eric Biggers
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Biggers @ 2023-07-13  7:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel,
	Christopher James Halse Rogers

On Wed, Jul 12, 2023 at 05:11:05PM -0400, Kent Overstreet wrote:
> errname() returns the name of an errcode; this functionality is
> otherwise only available for error pointers via %pE - bcachefs uses this
> for better error messages.

Interesting that this exists!  It seems you meant %pe, not %pE, though.

- Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-12 21:10 ` [PATCH 04/20] block: Add some exports for bcachefs Kent Overstreet
@ 2023-07-24 17:31   ` Christoph Hellwig
  2023-07-25  3:00     ` Kent Overstreet
  2023-07-25  2:59   ` Matthew Wilcox
  1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-24 17:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	linux-block, Jens Axboe

On Wed, Jul 12, 2023 at 05:10:59PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
>  - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path

Why?  We've so far have been able to get away without file systems
reinventing their own DIO path.  I'd really like to keep it that way,
so if you think the iomap dio code should be improved please explain
why.  And please also cycle the fsdevel list in.

>  - blk_status_to_str() - error messages
>  - bio_add_folio() - this should definitely be exported for everyone,
>    it's the modern version of bio_add_page()

These look ok to me to go in when the actual user shows up.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-07-12 21:11 ` [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
@ 2023-07-24 17:34   ` Christoph Hellwig
  2023-07-25  2:43     ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-24 17:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Jens Axboe, linux-block

On Wed, Jul 12, 2023 at 05:11:00PM -0400, Kent Overstreet wrote:
> bio_iov_iter_get_pages() trims the IO based on the block size of the
> block device the IO will be issued to.
> 
> However, bcachefs is a multi device filesystem; when we're creating the
> bio we don't yet know which block device the bio will be submitted to -
> we have to handle the alignment checks elsewhere.

So, we've been trying really hard to always make sure to pass a bdev
to anything that allocates a bio, mostly due due the fact that we
actually derive information like the blk-cgroup associations from it.

The whole blk-cgroup stuff is actually a problem for non-trivial
multi-device setups.  XFS gets away fine because each file just
sits on either the main or RT device and no user I/O goes to the
log device, and btrfs papers over it in a weird way by always
associating with the last added device, which is in many ways gross
and wrong, but at least satisfies the assumptions made in blk-cgroup.

How do you plan to deal with this?  Because I really don't want folks
just to go ahead and ignore the issues, we need to actually sort this
out.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] block: Bring back zero_fill_bio_iter
  2023-07-12 21:11 ` [PATCH 06/20] block: Bring back zero_fill_bio_iter Kent Overstreet
@ 2023-07-24 17:35   ` Christoph Hellwig
  2023-07-25  2:45     ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-24 17:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	Jens Axboe, linux-block

On Wed, Jul 12, 2023 at 05:11:01PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is
> used by bcachefs.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: linux-block@vger.kernel.org
> ---
>  block/bio.c         | 6 +++---
>  include/linux/bio.h | 7 ++++++-
>  2 files changed, 9 insertions(+), 4 deletions(-)

I really don't see any point in offering this in the block layer.  By
the lack of any other caller it obviously isn't such a generic and
always used helper, but more importantly it literally is three lines
of code to implement it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-07-24 17:34   ` Christoph Hellwig
@ 2023-07-25  2:43     ` Kent Overstreet
  2023-07-26 13:23       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-25  2:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Jens Axboe, linux-block

On Mon, Jul 24, 2023 at 10:34:20AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 05:11:00PM -0400, Kent Overstreet wrote:
> > bio_iov_iter_get_pages() trims the IO based on the block size of the
> > block device the IO will be issued to.
> > 
> > However, bcachefs is a multi device filesystem; when we're creating the
> > bio we don't yet know which block device the bio will be submitted to -
> > we have to handle the alignment checks elsewhere.
> 
> So, we've been trying really hard to always make sure to pass a bdev
> to anything that allocates a bio, mostly due due the fact that we
> actually derive information like the blk-cgroup associations from it.
> 
> The whole blk-cgroup stuff is actually a problem for non-trivial
> multi-device setups.  XFS gets away fine because each file just
> sits on either the main or RT device and no user I/O goes to the
> log device, and btrfs papers over it in a weird way by always
> associating with the last added device, which is in many ways gross
> and wrong, but at least satisfies the assumptions made in blk-cgroup.
> 
> How do you plan to deal with this?  Because I really don't want folks
> just to go ahead and ignore the issues, we need to actually sort this
> out.

Doing the blk-cgroup association at bio alloc time sounds broken to me,
because of stacking block devices - why was the association not done at
generic_make_request() time?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] block: Bring back zero_fill_bio_iter
  2023-07-24 17:35   ` Christoph Hellwig
@ 2023-07-25  2:45     ` Kent Overstreet
  2023-07-26 13:21       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-25  2:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	Jens Axboe, linux-block

On Mon, Jul 24, 2023 at 10:35:45AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 05:11:01PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is
> > used by bcachefs.
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: linux-block@vger.kernel.org
> > ---
> >  block/bio.c         | 6 +++---
> >  include/linux/bio.h | 7 ++++++-
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> I really don't see any point in offering this in the block layer.  By
> the lack of any other caller it obviously isn't such a generic and
> always used helper, but more importantly it literally is three lines
> of code to implement it.

And yet, we've had a subtle bug introduced in that code that took quite
awhile to be fixed - I'm not pro code duplication in general and I don't
think this is a good place to start.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-12 21:10 ` [PATCH 04/20] block: Add some exports for bcachefs Kent Overstreet
  2023-07-24 17:31   ` Christoph Hellwig
@ 2023-07-25  2:59   ` Matthew Wilcox
  1 sibling, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2023-07-25  2:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	linux-block, Jens Axboe

On Wed, Jul 12, 2023 at 05:10:59PM -0400, Kent Overstreet wrote:
>  - bio_add_folio() - this should definitely be exported for everyone,
>    it's the modern version of bio_add_page()

Looks like this one got added in cd57b77197a4, so it just needs to be
dropped from the changelog.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-24 17:31   ` Christoph Hellwig
@ 2023-07-25  3:00     ` Kent Overstreet
  2023-07-26 13:20       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-07-25  3:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	linux-block, Jens Axboe

On Mon, Jul 24, 2023 at 10:31:04AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 05:10:59PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> >  - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
> 
> Why?  We've so far have been able to get away without file systems
> reinventing their own DIO path.  I'd really like to keep it that way,
> so if you think the iomap dio code should be improved please explain
> why.  And please also cycle the fsdevel list in.

It's been discussed at length why bcachefs doesn't use iomap.

In short, iomap is heavily callback based, the bcachefs io paths are
not - we pass around data structures instead. I discussed this with
people when iomap was first being written, but iomap ended up being a
much more conservative approach, more in line with the old buffer heads
code where the generic code calls into the filesystem to obtain
mappings.

I'm gradually convincing people of the merits of the bcachefs approach -
in particular reducing indirect function calls is getting more attention
these days.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include
  2023-07-12 21:11 ` [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
@ 2023-07-25  3:04   ` Matthew Wilcox
  2023-07-25  3:36     ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2023-07-25  3:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet

On Wed, Jul 12, 2023 at 05:11:14PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> We now need linux/limits.h for SIZE_MAX.

I think this should be squashed into the previous commit

> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  include/linux/generic-radix-tree.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
> index 63080822dc..f6cd0f909d 100644
> --- a/include/linux/generic-radix-tree.h
> +++ b/include/linux/generic-radix-tree.h
> @@ -38,6 +38,7 @@
>  
>  #include <asm/page.h>
>  #include <linux/bug.h>
> +#include <linux/limits.h>
>  #include <linux/log2.h>
>  #include <linux/math.h>
>  #include <linux/types.h>
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include
  2023-07-25  3:04   ` Matthew Wilcox
@ 2023-07-25  3:36     ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-07-25  3:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet

On Tue, Jul 25, 2023 at 04:04:06AM +0100, Matthew Wilcox wrote:
> On Wed, Jul 12, 2023 at 05:11:14PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > We now need linux/limits.h for SIZE_MAX.
> 
> I think this should be squashed into the previous commit

No objection...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-25  3:00     ` Kent Overstreet
@ 2023-07-26 13:20       ` Christoph Hellwig
  2023-08-01 18:59         ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-26 13:20 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, linux-bcachefs, linux-fsdevel, linux-kernel,
	Kent Overstreet, linux-block, Jens Axboe

On Mon, Jul 24, 2023 at 11:00:37PM -0400, Kent Overstreet wrote:
> In short, iomap is heavily callback based, the bcachefs io paths are
> not - we pass around data structures instead. I discussed this with
> people when iomap was first being written, but iomap ended up being a
> much more conservative approach, more in line with the old buffer heads
> code where the generic code calls into the filesystem to obtain
> mappings.
> 
> I'm gradually convincing people of the merits of the bcachefs approach -
> in particular reducing indirect function calls is getting more attention
> these days.

FYI, Matthew has had patches that convert iomap to be an iterator,
and I've massage the first half of them and actuall got them in
before.  I'd much rather finish off that work (even if only for
direct I/O initially) than adding another direct I/O code.  But
even with out that we should be able to easily pass more private
data, in fact btrfs makes pretty heavy use of that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] block: Bring back zero_fill_bio_iter
  2023-07-25  2:45     ` Kent Overstreet
@ 2023-07-26 13:21       ` Christoph Hellwig
  2023-08-01 19:05         ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-26 13:21 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, linux-bcachefs, linux-fsdevel, linux-kernel,
	Kent Overstreet, Jens Axboe, linux-block

On Mon, Jul 24, 2023 at 10:45:53PM -0400, Kent Overstreet wrote:
> And yet, we've had a subtle bug introduced in that code that took quite
> awhile to be fixed - I'm not pro code duplication in general and I don't
> think this is a good place to start.

I'm not sure arguing for adding a helper your can triviall implement
yourself really helps to streamline your upstreaming process.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-07-25  2:43     ` Kent Overstreet
@ 2023-07-26 13:23       ` Christoph Hellwig
  2023-08-01 19:04         ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-07-26 13:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, linux-bcachefs, linux-fsdevel, linux-kernel,
	Jens Axboe, linux-block

On Mon, Jul 24, 2023 at 10:43:12PM -0400, Kent Overstreet wrote:
> Doing the blk-cgroup association at bio alloc time sounds broken to me,
> because of stacking block devices - why was the association not done at
> generic_make_request() time?

Because blk-cgroup not only works at the lowest level in the stack,
but also for stackable block devices.  It's not a design decision I
particularly agree with, but it's been there forever.

So we need to assign it when creating the bio (we used to do it at
submission time, but the way it was done was horribly ineffcient,
that's why I moved it to allocation time), and then when hitting a
stacked device it get reassinged (which still is horribly inefficient).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/20] block: Add some exports for bcachefs
  2023-07-26 13:20       ` Christoph Hellwig
@ 2023-08-01 18:59         ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-08-01 18:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	linux-block, Jens Axboe

On Wed, Jul 26, 2023 at 06:20:42AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 11:00:37PM -0400, Kent Overstreet wrote:
> > In short, iomap is heavily callback based, the bcachefs io paths are
> > not - we pass around data structures instead. I discussed this with
> > people when iomap was first being written, but iomap ended up being a
> > much more conservative approach, more in line with the old buffer heads
> > code where the generic code calls into the filesystem to obtain
> > mappings.
> > 
> > I'm gradually convincing people of the merits of the bcachefs approach -
> > in particular reducing indirect function calls is getting more attention
> > these days.
> 
> FYI, Matthew has had patches that convert iomap to be an iterator,
> and I've massage the first half of them and actuall got them in
> before.  I'd much rather finish off that work (even if only for
> direct I/O initially) than adding another direct I/O code.  But
> even with out that we should be able to easily pass more private
> data, in fact btrfs makes pretty heavy use of that.

That's wonderful, but getting iomap up to the level of what bcachefs
needs is still going to be a pretty big project and it's not going to be
my highest priority.

bcachefs also hangs more state off of the pagecache, in bcachefs's
equivvalent of iomap_page - we store reservations for dirty data there
and a few other things, which means the buffered IO paths don't have to
walk any other data structures.

I think that's another idea you guys will want to steal, but a higher
priority for me is getting a proper FUSE port done - and making bcachefs
more tightly weddded to VFS library code is not likely to make that
process any easier.

Once a proper fuse port is done and we know what that looks like will be
a better time for some consolidation.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-07-26 13:23       ` Christoph Hellwig
@ 2023-08-01 19:04         ` Kent Overstreet
  2023-08-02 11:47           ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-08-01 19:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Jens Axboe, linux-block

On Wed, Jul 26, 2023 at 06:23:05AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 10:43:12PM -0400, Kent Overstreet wrote:
> > Doing the blk-cgroup association at bio alloc time sounds broken to me,
> > because of stacking block devices - why was the association not done at
> > generic_make_request() time?
> 
> Because blk-cgroup not only works at the lowest level in the stack,
> but also for stackable block devices.  It's not a design decision I
> particularly agree with, but it's been there forever.

You're setting the association only to the highest block device in the
stack - how on earth is it supposed to work with anything lower?

And looking at bio_associate_blkg(), this code looks completely broken.
It's checking bio->bi_blkg, but that's just been set to NULL in
bio_init().

And this is your code, so I think you need to go over this again.

Anyways, bio_associate_blkg() is also called by bio_set_dev(), which
means passing the block device to bio_init() was a completely pointless
change. bcachefs uses bio_set_dev(), so everything is fine.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/20] block: Bring back zero_fill_bio_iter
  2023-07-26 13:21       ` Christoph Hellwig
@ 2023-08-01 19:05         ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-08-01 19:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Kent Overstreet,
	Jens Axboe, linux-block

On Wed, Jul 26, 2023 at 06:21:23AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 10:45:53PM -0400, Kent Overstreet wrote:
> > And yet, we've had a subtle bug introduced in that code that took quite
> > awhile to be fixed - I'm not pro code duplication in general and I don't
> > think this is a good place to start.
> 
> I'm not sure arguing for adding a helper your can triviall implement
> yourself really helps to streamline your upstreaming process.

I gave you my engineering reasons, you're the one who's arguing.

And to make everything perfectly clear: this is code that I originally
wrote, and then you started changing without CCing me - your patch that
deleted zero_fill_bio_iter() never should've gone in.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-08-01 19:04         ` Kent Overstreet
@ 2023-08-02 11:47           ` Christoph Hellwig
  2023-08-02 16:44             ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2023-08-02 11:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, linux-bcachefs, linux-fsdevel, linux-kernel,
	Jens Axboe, linux-block

On Tue, Aug 01, 2023 at 03:04:50PM -0400, Kent Overstreet wrote:
> > Because blk-cgroup not only works at the lowest level in the stack,
> > but also for stackable block devices.  It's not a design decision I
> > particularly agree with, but it's been there forever.
> 
> You're setting the association only to the highest block device in the
> stack - how on earth is it supposed to work with anything lower?

Hey, ask the cgroup folks as they come up with it.  I'm not going to
defend the logic here.

> And looking at bio_associate_blkg(), this code looks completely broken.
> It's checking bio->bi_blkg, but that's just been set to NULL in
> bio_init().

It's checking bi_blkg because it can also be called from bio_set_dev.

> And this is your code, so I think you need to go over this again.

It's "my code" in the sene of that I did one big round of unwinding
the even bigger mess that was there.  There is another few rounds needed
for the code to vaguely make sense.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-08-02 11:47           ` Christoph Hellwig
@ 2023-08-02 16:44             ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-08-02 16:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Jens Axboe, linux-block

On Wed, Aug 02, 2023 at 04:47:18AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 01, 2023 at 03:04:50PM -0400, Kent Overstreet wrote:
> > > Because blk-cgroup not only works at the lowest level in the stack,
> > > but also for stackable block devices.  It's not a design decision I
> > > particularly agree with, but it's been there forever.
> > 
> > You're setting the association only to the highest block device in the
> > stack - how on earth is it supposed to work with anything lower?
> 
> Hey, ask the cgroup folks as they come up with it.  I'm not going to
> defend the logic here.
> 
> > And looking at bio_associate_blkg(), this code looks completely broken.
> > It's checking bio->bi_blkg, but that's just been set to NULL in
> > bio_init().
> 
> It's checking bi_blkg because it can also be called from bio_set_dev.

So bio_set_dev() has subtly different behaviour than passing the block
device to bio_init()?

That's just broken.

> 
> > And this is your code, so I think you need to go over this again.
> 
> It's "my code" in the sene of that I did one big round of unwinding
> the even bigger mess that was there.  There is another few rounds needed
> for the code to vaguely make sense.

Well, I'll watch for those patches then...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-07-12 21:11 ` [PATCH 11/20] locking/osq: Export osq_(lock|unlock) Kent Overstreet
@ 2023-08-02 20:16   ` Waiman Long
  2023-08-02 20:44     ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Waiman Long @ 2023-08-02 20:16 UTC (permalink / raw)
  To: Kent Overstreet, linux-bcachefs, linux-fsdevel, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Boqun Feng

On 7/12/23 17:11, Kent Overstreet wrote:
> These are used by bcachefs's six locks.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   kernel/locking/osq_lock.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b..b752ec5cc6 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   
>   	return false;
>   }
> +EXPORT_SYMBOL_GPL(osq_lock);
>   
>   void osq_unlock(struct optimistic_spin_queue *lock)
>   {
> @@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>   	if (next)
>   		WRITE_ONCE(next->locked, 1);
>   }
> +EXPORT_SYMBOL_GPL(osq_unlock);

Have you considered extending the current rw_semaphore to support a SIX 
lock semantics? There are a number of instances in the kernel that a 
up_read() is followed by a down_write(). Basically, the code try to 
upgrade the lock from read to write. I have been thinking about adding a 
upgrade_read() API to do that. However, the concern that I had was that 
another writer may come in and make modification before the reader can 
be upgraded to have exclusive write access and will make the task to 
repeat what has been done in the read lock part. By adding a read with 
intent to upgrade to write, we can have that guarantee.

With that said, I would prefer to keep osq_{lock/unlock} for internal 
use by some higher level locking primitives - mutex, rwsem and rt_mutex.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-08-02 20:16   ` Waiman Long
@ 2023-08-02 20:44     ` Kent Overstreet
  2023-08-02 21:09       ` Waiman Long
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-08-02 20:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Boqun Feng

On Wed, Aug 02, 2023 at 04:16:12PM -0400, Waiman Long wrote:
> On 7/12/23 17:11, Kent Overstreet wrote:
> > These are used by bcachefs's six locks.
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >   kernel/locking/osq_lock.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index d5610ad52b..b752ec5cc6 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >   	return false;
> >   }
> > +EXPORT_SYMBOL_GPL(osq_lock);
> >   void osq_unlock(struct optimistic_spin_queue *lock)
> >   {
> > @@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
> >   	if (next)
> >   		WRITE_ONCE(next->locked, 1);
> >   }
> > +EXPORT_SYMBOL_GPL(osq_unlock);
> 
> Have you considered extending the current rw_semaphore to support a SIX lock
> semantics? There are a number of instances in the kernel that a up_read() is
> followed by a down_write(). Basically, the code try to upgrade the lock from
> read to write. I have been thinking about adding a upgrade_read() API to do
> that. However, the concern that I had was that another writer may come in
> and make modification before the reader can be upgraded to have exclusive
> write access and will make the task to repeat what has been done in the read
> lock part. By adding a read with intent to upgrade to write, we can have
> that guarantee.

It's been discussed, Linus had the same thought.

But it'd be a massive change to the rw semaphore code; this "read with
intent" really is a third lock state which needs all the same
lock/trylock/unlock paths, and with the way rw semaphore has separate
entry points for read and write it'd be a _ton_ of new code. It really
touches everything - waitlist handling included.

And six locks have several other features that bcachefs needs, and other
users may also end up wanting, that rw semaphores don't have; the two
main features being a percpu read lock mode and support for an external
cycle detector (which requires exposing lock waitlists, with some
guarantees about how those waitlists are used).

> With that said, I would prefer to keep osq_{lock/unlock} for internal use by
> some higher level locking primitives - mutex, rwsem and rt_mutex.

Yeah, I'm aware, but it seems like exposing osq_(lock|unlock) is the
most palatable solution for now. Long term, I'd like to get six locks
promoted to kernel/locking.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-08-02 20:44     ` Kent Overstreet
@ 2023-08-02 21:09       ` Waiman Long
  2023-08-02 21:42         ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Waiman Long @ 2023-08-02 21:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Boqun Feng

On 8/2/23 16:44, Kent Overstreet wrote:
> On Wed, Aug 02, 2023 at 04:16:12PM -0400, Waiman Long wrote:
>> On 7/12/23 17:11, Kent Overstreet wrote:
>>> These are used by bcachefs's six locks.
>>>
>>> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Waiman Long <longman@redhat.com>
>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>> ---
>>>    kernel/locking/osq_lock.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>>> index d5610ad52b..b752ec5cc6 100644
>>> --- a/kernel/locking/osq_lock.c
>>> +++ b/kernel/locking/osq_lock.c
>>> @@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>>    	return false;
>>>    }
>>> +EXPORT_SYMBOL_GPL(osq_lock);
>>>    void osq_unlock(struct optimistic_spin_queue *lock)
>>>    {
>>> @@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>>>    	if (next)
>>>    		WRITE_ONCE(next->locked, 1);
>>>    }
>>> +EXPORT_SYMBOL_GPL(osq_unlock);
>> Have you considered extending the current rw_semaphore to support a SIX lock
>> semantics? There are a number of instances in the kernel that a up_read() is
>> followed by a down_write(). Basically, the code try to upgrade the lock from
>> read to write. I have been thinking about adding a upgrade_read() API to do
>> that. However, the concern that I had was that another writer may come in
>> and make modification before the reader can be upgraded to have exclusive
>> write access and will make the task to repeat what has been done in the read
>> lock part. By adding a read with intent to upgrade to write, we can have
>> that guarantee.
> It's been discussed, Linus had the same thought.
>
> But it'd be a massive change to the rw semaphore code; this "read with
> intent" really is a third lock state which needs all the same
> lock/trylock/unlock paths, and with the way rw semaphore has separate
> entry points for read and write it'd be a _ton_ of new code. It really
> touches everything - waitlist handling included.

Yes, it is a major change, but I had done that before and it is 
certainly doable. There are spare bits in the low byte of rwsem->count 
that can be used as an intent bit. We also need to add a new 
rwsem_wake_type for that for waitlist handling.


>
> And six locks have several other features that bcachefs needs, and other
> users may also end up wanting, that rw semaphores don't have; the two
> main features being a percpu read lock mode and support for an external
> cycle detector (which requires exposing lock waitlists, with some
> guarantees about how those waitlists are used).

Can you provide more information about those features?

>
>> With that said, I would prefer to keep osq_{lock/unlock} for internal use by
>> some higher level locking primitives - mutex, rwsem and rt_mutex.
> Yeah, I'm aware, but it seems like exposing osq_(lock|unlock) is the
> most palatable solution for now. Long term, I'd like to get six locks
> promoted to kernel/locking.

Your SIX overlaps with rwsem in term of features. So we will have to 
somehow merge them instead of having 2 APIs with somewhat similar 
functionality.

Cheers,
Longman

>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-08-02 21:09       ` Waiman Long
@ 2023-08-02 21:42         ` Kent Overstreet
  2023-10-10  8:09           ` [NAK] " Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Kent Overstreet @ 2023-08-02 21:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-bcachefs, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Boqun Feng

On Wed, Aug 02, 2023 at 05:09:13PM -0400, Waiman Long wrote:
> On 8/2/23 16:44, Kent Overstreet wrote:
> > On Wed, Aug 02, 2023 at 04:16:12PM -0400, Waiman Long wrote:
> > > On 7/12/23 17:11, Kent Overstreet wrote:
> > > > These are used by bcachefs's six locks.
> > > > 
> > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > Cc: Waiman Long <longman@redhat.com>
> > > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > > ---
> > > >    kernel/locking/osq_lock.c | 2 ++
> > > >    1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > > index d5610ad52b..b752ec5cc6 100644
> > > > --- a/kernel/locking/osq_lock.c
> > > > +++ b/kernel/locking/osq_lock.c
> > > > @@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > > >    	return false;
> > > >    }
> > > > +EXPORT_SYMBOL_GPL(osq_lock);
> > > >    void osq_unlock(struct optimistic_spin_queue *lock)
> > > >    {
> > > > @@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
> > > >    	if (next)
> > > >    		WRITE_ONCE(next->locked, 1);
> > > >    }
> > > > +EXPORT_SYMBOL_GPL(osq_unlock);
> > > Have you considered extending the current rw_semaphore to support a SIX lock
> > > semantics? There are a number of instances in the kernel that a up_read() is
> > > followed by a down_write(). Basically, the code try to upgrade the lock from
> > > read to write. I have been thinking about adding a upgrade_read() API to do
> > > that. However, the concern that I had was that another writer may come in
> > > and make modification before the reader can be upgraded to have exclusive
> > > write access and will make the task to repeat what has been done in the read
> > > lock part. By adding a read with intent to upgrade to write, we can have
> > > that guarantee.
> > It's been discussed, Linus had the same thought.
> > 
> > But it'd be a massive change to the rw semaphore code; this "read with
> > intent" really is a third lock state which needs all the same
> > lock/trylock/unlock paths, and with the way rw semaphore has separate
> > entry points for read and write it'd be a _ton_ of new code. It really
> > touches everything - waitlist handling included.
> 
> Yes, it is a major change, but I had done that before and it is certainly
> doable. There are spare bits in the low byte of rwsem->count that can be
> used as an intent bit. We also need to add a new rwsem_wake_type for that
> for waitlist handling.
> 
> 
> > 
> > And six locks have several other features that bcachefs needs, and other
> > users may also end up wanting, that rw semaphores don't have; the two
> > main features being a percpu read lock mode and support for an external
> > cycle detector (which requires exposing lock waitlists, with some
> > guarantees about how those waitlists are used).
> 
> Can you provide more information about those features?
> 
> > 
> > > With that said, I would prefer to keep osq_{lock/unlock} for internal use by
> > > some higher level locking primitives - mutex, rwsem and rt_mutex.
> > Yeah, I'm aware, but it seems like exposing osq_(lock|unlock) is the
> > most palatable solution for now. Long term, I'd like to get six locks
> > promoted to kernel/locking.
> 
> Your SIX overlaps with rwsem in term of features. So we will have to somehow
> merge them instead of having 2 APIs with somewhat similar functionality.

Waiman, if you think you can add all the features of six locks to rwsem,
knock yourself out - but right now this is a vaporware idea for you, not
something I can seriously entertain. I'm looking to merge bcachefs next
cycle, not sit around and bikeshed for the next six months.

If you start making a serious effort on adding those features to rwsem
I'll start walking you through everything six locks has, but right now
this is a major digression on a patch that just exports two symbols.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [NAK] Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-08-02 21:42         ` Kent Overstreet
@ 2023-10-10  8:09           ` Ingo Molnar
  2023-10-18 21:04             ` Kent Overstreet
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2023-10-10  8:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Waiman Long, linux-bcachefs, linux-fsdevel, linux-kernel,
	Peter Zijlstra, Ingo Molnar, Boqun Feng, Linus Torvalds,
	Thomas Gleixner, Stephen Rothwell


* Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, Aug 02, 2023 at 05:09:13PM -0400, Waiman Long wrote:
> > On 8/2/23 16:44, Kent Overstreet wrote:
> > > On Wed, Aug 02, 2023 at 04:16:12PM -0400, Waiman Long wrote:
> > > > On 7/12/23 17:11, Kent Overstreet wrote:
> > > > > These are used by bcachefs's six locks.
> > > > > 
> > > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > > Cc: Waiman Long <longman@redhat.com>
> > > > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > > > ---
> > > > >    kernel/locking/osq_lock.c | 2 ++
> > > > >    1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > > > index d5610ad52b..b752ec5cc6 100644
> > > > > --- a/kernel/locking/osq_lock.c
> > > > > +++ b/kernel/locking/osq_lock.c
> > > > > @@ -203,6 +203,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > > > >    	return false;
> > > > >    }
> > > > > +EXPORT_SYMBOL_GPL(osq_lock);
> > > > >    void osq_unlock(struct optimistic_spin_queue *lock)
> > > > >    {
> > > > > @@ -230,3 +231,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
> > > > >    	if (next)
> > > > >    		WRITE_ONCE(next->locked, 1);
> > > > >    }
> > > > > +EXPORT_SYMBOL_GPL(osq_unlock);
> > > > Have you considered extending the current rw_semaphore to support a SIX lock
> > > > semantics? There are a number of instances in the kernel that a up_read() is
> > > > followed by a down_write(). Basically, the code try to upgrade the lock from
> > > > read to write. I have been thinking about adding a upgrade_read() API to do
> > > > that. However, the concern that I had was that another writer may come in
> > > > and make modification before the reader can be upgraded to have exclusive
> > > > write access and will make the task to repeat what has been done in the read
> > > > lock part. By adding a read with intent to upgrade to write, we can have
> > > > that guarantee.
> > > It's been discussed, Linus had the same thought.
> > > 
> > > But it'd be a massive change to the rw semaphore code; this "read with
> > > intent" really is a third lock state which needs all the same
> > > lock/trylock/unlock paths, and with the way rw semaphore has separate
> > > entry points for read and write it'd be a _ton_ of new code. It really
> > > touches everything - waitlist handling included.
> > 
> > Yes, it is a major change, but I had done that before and it is certainly
> > doable. There are spare bits in the low byte of rwsem->count that can be
> > used as an intent bit. We also need to add a new rwsem_wake_type for that
> > for waitlist handling.
> > 
> > 
> > > 
> > > And six locks have several other features that bcachefs needs, and other
> > > users may also end up wanting, that rw semaphores don't have; the two
> > > main features being a percpu read lock mode and support for an external
> > > cycle detector (which requires exposing lock waitlists, with some
> > > guarantees about how those waitlists are used).
> > 
> > Can you provide more information about those features?
> > 
> > > 
> > > > With that said, I would prefer to keep osq_{lock/unlock} for internal use by
> > > > some higher level locking primitives - mutex, rwsem and rt_mutex.
> > > Yeah, I'm aware, but it seems like exposing osq_(lock|unlock) is the
> > > most palatable solution for now. Long term, I'd like to get six locks
> > > promoted to kernel/locking.
> > 
> > Your SIX overlaps with rwsem in term of features. So we will have to somehow
> > merge them instead of having 2 APIs with somewhat similar functionality.
> 
> Waiman, if you think you can add all the features of six locks to rwsem,
> knock yourself out - but right now this is a vaporware idea for you, not
> something I can seriously entertain. I'm looking to merge bcachefs next
> cycle, not sit around and bikeshed for the next six months.

That's an entirely inappropriate response to valid review feedback.

Not having two overlapping locking facilities is not 'bikeshedding' at all ...

> If you start making a serious effort on adding those features to rwsem
> I'll start walking you through everything six locks has, but right now
> this is a major digression on a patch that just exports two symbols.

In Linux the burden of work is on people submitting new code, not on 
reviewers. The rule is that you should not reinvent the wheel in new
features - extend existing locking facilities please.

Waiman gave you some pointers as to how to extend rwsems.

Meanwhile, NAK on the export of osq_(lock|unlock):

    NAKed-by: Ingo Molnar <mingo@kernel.org>

Ie. NAK on this commit in linux-next:

    97da2065b7cb ("locking/osq: Export osq_(lock|unlock)")

This is an internal function of Linux locking subsystem we would not like to
expose or export.

This commit was applied without an Ack or Reviewed-by by a locking subsystem
maintainer or reviewer (which Waiman Long is).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [NAK] Re: [PATCH 11/20] locking/osq: Export osq_(lock|unlock)
  2023-10-10  8:09           ` [NAK] " Ingo Molnar
@ 2023-10-18 21:04             ` Kent Overstreet
  0 siblings, 0 replies; 47+ messages in thread
From: Kent Overstreet @ 2023-10-18 21:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, linux-bcachefs, linux-fsdevel, linux-kernel,
	Peter Zijlstra, Ingo Molnar, Boqun Feng, Linus Torvalds,
	Thomas Gleixner, Stephen Rothwell

On Tue, Oct 10, 2023 at 10:09:38AM +0200, Ingo Molnar wrote:
> > Waiman, if you think you can add all the features of six locks to rwsem,
> > knock yourself out - but right now this is a vaporware idea for you, not
> > something I can seriously entertain. I'm looking to merge bcachefs next
> > cycle, not sit around and bikeshed for the next six months.
> 
> That's an entirely inappropriate response to valid review feedback.
> 
> Not having two overlapping locking facilities is not 'bikeshedding' at all ...

Well, there was already a long off-list discussion about adding six lock
features to rwsem.

Basically, it looks to me like a total redesign of rwsem in order to do
it correctly, and I don't think that would fly. The rwsem code has
separate entrypoints for every lock state, and adding a third lock state
would at a minimum add a lot of new - nearly duplicate - code.

There's also features and optimizations in six locks that rwsem doesn't
have, and it's not clear to me that it would be appropriate to add them
to rwsem - each of them would need real discussion. The big ones are:

 - percpu reader mode, used for locks for interior nodes and subvolume
   keys in bcachefs
 - exposing of waitlist entries (and this requires nontrivial guarantees
   to do correctly!), so that bcachefs can do cycle detection deadlock
   avoidance on top.

In short, this would _not_ be a small project, and I think the saner
approach if we really did want to condense down to a single locking
implementation would be to replace rwsem with six locks. But before even
contemplating that we'd want to see six locks getting wider usage and
testing first.

Hence why we're at leaving six locks in fs/bcachefs/ for now.

> > If you start making a serious effort on adding those features to rwsem
> > I'll start walking you through everything six locks has, but right now
> > this is a major digression on a patch that just exports two symbols.
> 
> In Linux the burden of work is on people submitting new code, not on 
> reviewers. The rule is that you should not reinvent the wheel in new
> features - extend existing locking facilities please.
> 
> Waiman gave you some pointers as to how to extend rwsems.
> 
> Meanwhile, NAK on the export of osq_(lock|unlock):

Perhaps we could get some justification for why you want osq locks to be
private?

My initial pull request had six locks in kernel/locking/, specifically
to keep osq locks private, as requested by locking people (some years
back). But since Linus shot that down, I need an alternative.

If you're really dead set against exporting osq locks (and again, why?),
my only alternative will be to either take optimistic spinning out of
six locks, or implement optimistic spinning another way (which is
something I was already looking at before; the way lock handoff works in
six locks now makes that an attractive idea anyways, but of course the
devil is in the details with locking code).

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-10-18 21:12 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12 21:10 [PATCH 00/20] bcachefs prereqs patch series Kent Overstreet
2023-07-12 21:10 ` [PATCH 01/20] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
2023-07-12 21:10 ` [PATCH 02/20] fs: factor out d_mark_tmpfile() Kent Overstreet
2023-07-12 21:10 ` [PATCH 03/20] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Kent Overstreet
2023-07-12 21:10 ` [PATCH 04/20] block: Add some exports for bcachefs Kent Overstreet
2023-07-24 17:31   ` Christoph Hellwig
2023-07-25  3:00     ` Kent Overstreet
2023-07-26 13:20       ` Christoph Hellwig
2023-08-01 18:59         ` Kent Overstreet
2023-07-25  2:59   ` Matthew Wilcox
2023-07-12 21:11 ` [PATCH 05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
2023-07-24 17:34   ` Christoph Hellwig
2023-07-25  2:43     ` Kent Overstreet
2023-07-26 13:23       ` Christoph Hellwig
2023-08-01 19:04         ` Kent Overstreet
2023-08-02 11:47           ` Christoph Hellwig
2023-08-02 16:44             ` Kent Overstreet
2023-07-12 21:11 ` [PATCH 06/20] block: Bring back zero_fill_bio_iter Kent Overstreet
2023-07-24 17:35   ` Christoph Hellwig
2023-07-25  2:45     ` Kent Overstreet
2023-07-26 13:21       ` Christoph Hellwig
2023-08-01 19:05         ` Kent Overstreet
2023-07-12 21:11 ` [PATCH 07/20] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
2023-07-12 21:11 ` [PATCH 08/20] stacktrace: Export stack_trace_save_tsk Kent Overstreet
2023-07-12 21:11 ` [PATCH 09/20] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
2023-07-12 21:11 ` [PATCH 10/20] lib: Export errname Kent Overstreet
2023-07-13  7:10   ` Eric Biggers
2023-07-12 21:11 ` [PATCH 11/20] locking/osq: Export osq_(lock|unlock) Kent Overstreet
2023-08-02 20:16   ` Waiman Long
2023-08-02 20:44     ` Kent Overstreet
2023-08-02 21:09       ` Waiman Long
2023-08-02 21:42         ` Kent Overstreet
2023-10-10  8:09           ` [NAK] " Ingo Molnar
2023-10-18 21:04             ` Kent Overstreet
2023-07-12 21:11 ` [PATCH 12/20] bcache: move closures to lib/ Kent Overstreet
2023-07-13  3:21   ` Randy Dunlap
2023-07-13  3:52     ` Kent Overstreet
2023-07-12 21:11 ` [PATCH 13/20] MAINTAINERS: Add entry for closures Kent Overstreet
2023-07-12 21:11 ` [PATCH 14/20] closures: closure_wait_event() Kent Overstreet
2023-07-12 21:11 ` [PATCH 15/20] closures: closure_nr_remaining() Kent Overstreet
2023-07-12 21:11 ` [PATCH 16/20] closures: Add a missing include Kent Overstreet
2023-07-12 21:11 ` [PATCH 17/20] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
2023-07-12 21:11 ` [PATCH 18/20] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
2023-07-12 21:11 ` [PATCH 19/20] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
2023-07-25  3:04   ` Matthew Wilcox
2023-07-25  3:36     ` Kent Overstreet
2023-07-12 21:11 ` [PATCH 20/20] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.