All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup
@ 2024-04-08 14:48 Brian Foster
  2024-04-08 14:48 ` [PATCH v3 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Brian Foster @ 2024-04-08 14:48 UTC (permalink / raw)
  To: linux-bcachefs

Hi all,

Here's v3 of the fiemap delalloc support patches for bcachefs. The main
difference from v2 is that the pagecache seek calls are now using
nonblocking mode to avoid deadlocks between fiemap and the write path.
The write path locks in folio -> extent btree order while fiemap walks
the extent btree and scans for folios in transaction context. Therefore,
the latter must restart the iterating transaction in the event of folio
trylock failure and restart the scan from where it left off.

The series is pushed to CI via my test branch, as usual:

https://evilpiepirate.org/~testdashboard/ci?branch=bfoster

Brian

v3:
- Use nonblocking pagecache seek to address extent btree vs. folio
  deadlocks.
v2: https://lore.kernel.org/linux-bcachefs/20240117155611.248042-1-bfoster@redhat.com/
- Clean up bch2_fiemap() a bit to use processing helpers.
- Added patch for sync flag fixup.
v1: https://lore.kernel.org/linux-bcachefs/20231219140215.300753-1-bfoster@redhat.com/

Brian Foster (4):
  bcachefs: drop duplicate fiemap sync flag
  bcachefs: track current fiemap offset in start variable
  bcachefs: refactor fiemap processing into extent helper and struct
  bcachefs: add fiemap delalloc extent detection

 fs/bcachefs/fs.c | 175 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 133 insertions(+), 42 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/4] bcachefs: drop duplicate fiemap sync flag
  2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
@ 2024-04-08 14:48 ` Brian Foster
  2024-04-08 14:48 ` [PATCH v3 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2024-04-08 14:48 UTC (permalink / raw)
  To: linux-bcachefs

FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67ad ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index b5ea9fa1259d..137ce028bc3c 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -967,7 +967,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	u32 snapshot;
 	int ret = 0;
 
-	ret = fiemap_prep(&ei->v, info, start, &len, FIEMAP_FLAG_SYNC);
+	ret = fiemap_prep(&ei->v, info, start, &len, 0);
 	if (ret)
 		return ret;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/4] bcachefs: track current fiemap offset in start variable
  2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
  2024-04-08 14:48 ` [PATCH v3 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
@ 2024-04-08 14:48 ` Brian Foster
  2024-04-08 14:48 ` [PATCH v3 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2024-04-08 14:48 UTC (permalink / raw)
  To: linux-bcachefs

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 137ce028bc3c..9c86749b256e 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1015,6 +1015,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
 
 		sectors = min(sectors, k.k->size - offset_into_extent);
+		start = iter.pos.offset + sectors;
 
 		bch2_cut_front(POS(k.k->p.inode,
 				   bkey_start_offset(k.k) +
@@ -1035,8 +1036,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		bkey_copy(prev.k, cur.k);
 		have_extent = true;
 
-		bch2_btree_iter_set_pos(&iter,
-			POS(iter.pos.inode, iter.pos.offset + sectors));
+		bch2_btree_iter_set_pos(&iter, POS(iter.pos.inode, start));
 	}
 	start = iter.pos.offset;
 	bch2_trans_iter_exit(trans, &iter);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/4] bcachefs: refactor fiemap processing into extent helper and struct
  2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
  2024-04-08 14:48 ` [PATCH v3 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
  2024-04-08 14:48 ` [PATCH v3 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
@ 2024-04-08 14:48 ` Brian Foster
  2024-04-08 14:48 ` [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
  2024-04-09  3:33 ` [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Kent Overstreet
  4 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2024-04-08 14:48 UTC (permalink / raw)
  To: linux-bcachefs

The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.

Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.

No functional changes intended by this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 90 +++++++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 35 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 9c86749b256e..9f14eb256c09 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -896,10 +896,18 @@ static int bch2_tmpfile(struct mnt_idmap *idmap,
 	return finish_open_simple(file, 0);
 }
 
+struct bch_fiemap_extent {
+	struct bkey_buf	kbuf;
+	unsigned	flags;
+};
+
 static int bch2_fill_extent(struct bch_fs *c,
 			    struct fiemap_extent_info *info,
-			    struct bkey_s_c k, unsigned flags)
+			    struct bch_fiemap_extent *fe)
 {
+	struct bkey_s_c k = bkey_i_to_s_c(fe->kbuf.k);
+	unsigned flags = fe->flags;
+
 	if (bkey_extent_is_direct_data(k.k)) {
 		struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 		const union bch_extent_entry *entry;
@@ -952,6 +960,39 @@ static int bch2_fill_extent(struct bch_fs *c,
 	}
 }
 
+static int bch2_fiemap_extent(struct btree_trans *trans,
+			      struct btree_iter *iter, struct bkey_s_c k,
+			      struct bch_fiemap_extent *cur)
+{
+	unsigned	offset_into_extent, sectors;
+	enum btree_id	data_btree = BTREE_ID_extents;
+	int		ret;
+
+	offset_into_extent	= iter->pos.offset - bkey_start_offset(k.k);
+	sectors			= k.k->size - offset_into_extent;
+
+	bch2_bkey_buf_reassemble(&cur->kbuf, trans->c, k);
+
+	ret = bch2_read_indirect_extent(trans, &data_btree, &offset_into_extent,
+					&cur->kbuf);
+	if (ret)
+		return ret;
+
+	k = bkey_i_to_s_c(cur->kbuf.k);
+	sectors = min(sectors, k.k->size - offset_into_extent);
+
+	bch2_cut_front(POS(k.k->p.inode,
+			   bkey_start_offset(k.k) + offset_into_extent),
+		       cur->kbuf.k);
+	bch2_key_resize(&cur->kbuf.k->k, sectors);
+	cur->kbuf.k->k.p = iter->pos;
+	cur->kbuf.k->k.p.offset += cur->kbuf.k->k.size;
+
+	cur->flags = 0;
+
+	return 0;
+}
+
 static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		       u64 start, u64 len)
 {
@@ -960,9 +1001,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	struct btree_trans *trans;
 	struct btree_iter iter;
 	struct bkey_s_c k;
-	struct bkey_buf cur, prev;
+	struct bch_fiemap_extent cur, prev;
 	struct bpos end = POS(ei->v.i_ino, (start + len) >> 9);
-	unsigned offset_into_extent, sectors;
 	bool have_extent = false;
 	u32 snapshot;
 	int ret = 0;
@@ -976,8 +1016,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 
 	start >>= 9;
 
-	bch2_bkey_buf_init(&cur);
-	bch2_bkey_buf_init(&prev);
+	bch2_bkey_buf_init(&cur.kbuf);
+	bch2_bkey_buf_init(&prev.kbuf);
 	trans = bch2_trans_get(c);
 retry:
 	bch2_trans_begin(trans);
@@ -992,7 +1032,6 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	while (!(ret = btree_trans_too_many_iters(trans)) &&
 	       (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
 	       !(ret = bkey_err(k))) {
-		enum btree_id data_btree = BTREE_ID_extents;
 
 		if (!bkey_extent_is_data(k.k) &&
 		    k.k->type != KEY_TYPE_reservation) {
@@ -1000,40 +1039,21 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 			continue;
 		}
 
-		offset_into_extent	= iter.pos.offset -
-			bkey_start_offset(k.k);
-		sectors			= k.k->size - offset_into_extent;
-
-		bch2_bkey_buf_reassemble(&cur, c, k);
-
-		ret = bch2_read_indirect_extent(trans, &data_btree,
-					&offset_into_extent, &cur);
+		ret = bch2_fiemap_extent(trans, &iter, k, &cur);
 		if (ret)
 			break;
-
-		k = bkey_i_to_s_c(cur.k);
-		bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
-
-		sectors = min(sectors, k.k->size - offset_into_extent);
-		start = iter.pos.offset + sectors;
-
-		bch2_cut_front(POS(k.k->p.inode,
-				   bkey_start_offset(k.k) +
-				   offset_into_extent),
-			       cur.k);
-		bch2_key_resize(&cur.k->k, sectors);
-		cur.k->k.p = iter.pos;
-		cur.k->k.p.offset += cur.k->k.size;
+		bch2_bkey_buf_realloc(&prev.kbuf, c, cur.kbuf.k->k.u64s);
+		start = cur.kbuf.k->k.p.offset;
 
 		if (have_extent) {
 			bch2_trans_unlock(trans);
-			ret = bch2_fill_extent(c, info,
-					bkey_i_to_s_c(prev.k), 0);
+			ret = bch2_fill_extent(c, info, &prev);
 			if (ret)
 				break;
 		}
 
-		bkey_copy(prev.k, cur.k);
+		bkey_copy(prev.kbuf.k, cur.kbuf.k);
+		prev.flags = cur.flags;
 		have_extent = true;
 
 		bch2_btree_iter_set_pos(&iter, POS(iter.pos.inode, start));
@@ -1046,13 +1066,13 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 
 	if (!ret && have_extent) {
 		bch2_trans_unlock(trans);
-		ret = bch2_fill_extent(c, info, bkey_i_to_s_c(prev.k),
-				       FIEMAP_EXTENT_LAST);
+		prev.flags |= FIEMAP_EXTENT_LAST;
+		ret = bch2_fill_extent(c, info, &prev);
 	}
 
 	bch2_trans_put(trans);
-	bch2_bkey_buf_exit(&cur, c);
-	bch2_bkey_buf_exit(&prev, c);
+	bch2_bkey_buf_exit(&cur.kbuf, c);
+	bch2_bkey_buf_exit(&prev.kbuf, c);
 	return ret < 0 ? ret : 0;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection
  2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
                   ` (2 preceding siblings ...)
  2024-04-08 14:48 ` [PATCH v3 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
@ 2024-04-08 14:48 ` Brian Foster
  2024-04-18 14:46   ` Brian Foster
  2024-04-09  3:33 ` [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Kent Overstreet
  4 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2024-04-08 14:48 UTC (permalink / raw)
  To: linux-bcachefs

bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.

Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.

Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 87 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 9f14eb256c09..77a8c826b637 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -993,6 +993,51 @@ static int bch2_fiemap_extent(struct btree_trans *trans,
 	return 0;
 }
 
+/*
+ * Scan a range of pagecache that corresponds to a file mapping hole in the
+ * extent btree. If found, fake up an extent key so it looks like a delalloc
+ * extent to the rest of the fiemap processing code.
+ *
+ * Returns 0 if cached data was found, -ENOENT if not, or -EAGAIN for folio
+ * trylock failure.
+ */
+static int
+bch2_fiemap_hole(struct inode *vinode, u64 start, u64 end,
+		 struct bch_fiemap_extent *cur)
+{
+	struct bch_fs		*c = vinode->i_sb->s_fs_info;
+	struct bch_inode_info	*ei = to_bch_ei(vinode);
+	struct bkey_i_extent	*delextent;
+	struct bch_extent_ptr	ptr = {};
+	loff_t			dstart, dend;
+
+	/*
+	 * We hold btree node locks here so we cannot block on folio locks.
+	 * Propagate -EAGAIN errors so the caller can retry.
+	 */
+	dstart = bch2_seek_pagecache_data(vinode, start, end, 0, true);
+	if (dstart < 0)
+		return dstart;
+	if (dstart >= end)
+		return -ENOENT;
+	dend = bch2_seek_pagecache_hole(vinode, dstart, end, 0, true);
+
+	/*
+	 * Create a fake extent key in the buffer. We have to add a dummy extent
+	 * pointer for the fill code to add an extent entry. It's explicitly
+	 * zeroed to reflect delayed allocation (i.e. phys offset 0).
+	 */
+	bch2_bkey_buf_realloc(&cur->kbuf, c, sizeof(*delextent) / sizeof(u64));
+	delextent = bkey_extent_init(cur->kbuf.k);
+	delextent->k.p = POS(ei->v.i_ino, dstart >> 9);
+	bch2_key_resize(&delextent->k, (dend - dstart) >> 9);
+	bch2_bkey_append_ptr(&delextent->k_i, ptr);
+
+	cur->flags = FIEMAP_EXTENT_DELALLOC;
+
+	return 0;
+}
+
 static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		       u64 start, u64 len)
 {
@@ -1032,16 +1077,41 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	while (!(ret = btree_trans_too_many_iters(trans)) &&
 	       (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
 	       !(ret = bkey_err(k))) {
+		bool have_delalloc = false;
 
-		if (!bkey_extent_is_data(k.k) &&
-		    k.k->type != KEY_TYPE_reservation) {
-			bch2_btree_iter_advance(&iter);
-			continue;
+		/*
+		 * If a hole exists before the start of the extent key, scan the
+		 * range for pagecache data that might be pending writeback and
+		 * thus not yet exist in the extent tree. Note that this can
+		 * return -EAGAIN to request to retry folio locks.
+		 */
+		if (iter.pos.offset > start) {
+			ret = bch2_fiemap_hole(vinode, start << 9,
+					       iter.pos.offset << 9, &cur);
+			if (!ret)
+				have_delalloc = true;
+			else if (ret != -ENOENT)
+				break;
 		}
 
-		ret = bch2_fiemap_extent(trans, &iter, k, &cur);
-		if (ret)
-			break;
+		 /* process the current key if there's no delalloc to report */
+		if (!have_delalloc) {
+			if (!bkey_extent_is_data(k.k) &&
+			    k.k->type != KEY_TYPE_reservation) {
+				start = bkey_start_offset(k.k) + k.k->size;
+				bch2_btree_iter_advance(&iter);
+				continue;
+			}
+
+			ret = bch2_fiemap_extent(trans, &iter, k, &cur);
+			if (ret)
+				break;
+		}
+
+		/*
+		 * Store the current extent in prev so we can flag the last
+		 * extent on the way out.
+		 */
 		bch2_bkey_buf_realloc(&prev.kbuf, c, cur.kbuf.k->k.u64s);
 		start = cur.kbuf.k->k.p.offset;
 
@@ -1061,7 +1131,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	start = iter.pos.offset;
 	bch2_trans_iter_exit(trans, &iter);
 err:
-	if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+	if (ret == -EAGAIN ||
+	    bch2_err_matches(ret, BCH_ERR_transaction_restart))
 		goto retry;
 
 	if (!ret && have_extent) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup
  2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
                   ` (3 preceding siblings ...)
  2024-04-08 14:48 ` [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
@ 2024-04-09  3:33 ` Kent Overstreet
  2024-04-09 12:21   ` Brian Foster
  4 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2024-04-09  3:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs, Eric Van Hensbergen, v9fs

On Mon, Apr 08, 2024 at 10:48:42AM -0400, Brian Foster wrote:
> Hi all,
> 
> Here's v3 of the fiemap delalloc support patches for bcachefs. The main
> difference from v2 is that the pagecache seek calls are now using
> nonblocking mode to avoid deadlocks between fiemap and the write path.
> The write path locks in folio -> extent btree order while fiemap walks
> the extent btree and scans for folios in transaction context. Therefore,
> the latter must restart the iterating transaction in the event of folio
> trylock failure and restart the scan from where it left off.
> 
> The series is pushed to CI via my test branch, as usual:
> 
> https://evilpiepirate.org/~testdashboard/ci?branch=bfoster

So our testing is currently busted because of a bug in 9pfs - I need you
to rebase onto my 9p-revert branch so we can get test results we can
look at:
https://evilpiepirate.org/git/bcachefs.git/log/?h=9p-revert

Eric, is this getting fixed? If it's not, and _soon_, we need to send
this revert to Linus - then put it through more testing, more asserts,
whatever you need and send it again next cycle, _after_ you've figured
out what wwent wrong.

Patch series looks good, though - if there's nothing you think needs
special attention I'll go ahead and merge it.

I wonder if at some point we could clean up the indirect extent lookup,
that's common between this code and the read paths, but that's low
priority.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup
  2024-04-09  3:33 ` [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Kent Overstreet
@ 2024-04-09 12:21   ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2024-04-09 12:21 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, Eric Van Hensbergen, v9fs

On Mon, Apr 08, 2024 at 11:33:05PM -0400, Kent Overstreet wrote:
> On Mon, Apr 08, 2024 at 10:48:42AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > Here's v3 of the fiemap delalloc support patches for bcachefs. The main
> > difference from v2 is that the pagecache seek calls are now using
> > nonblocking mode to avoid deadlocks between fiemap and the write path.
> > The write path locks in folio -> extent btree order while fiemap walks
> > the extent btree and scans for folios in transaction context. Therefore,
> > the latter must restart the iterating transaction in the event of folio
> > trylock failure and restart the scan from where it left off.
> > 
> > The series is pushed to CI via my test branch, as usual:
> > 
> > https://evilpiepirate.org/~testdashboard/ci?branch=bfoster
> 
> So our testing is currently busted because of a bug in 9pfs - I need you
> to rebase onto my 9p-revert branch so we can get test results we can
> look at:
> https://evilpiepirate.org/git/bcachefs.git/log/?h=9p-revert
> 

Huh, Ok. I also noticed an issue recently related to some sysfs
directory structure changes. That required me to rebase on
bcachefs-testing, though it appears that has since been resolved in
master (and 9p-revert).

I just rebased my test branch on 9p-revert and pushed. What failures are
being caused by 9p, anyways?

> Eric, is this getting fixed? If it's not, and _soon_, we need to send
> this revert to Linus - then put it through more testing, more asserts,
> whatever you need and send it again next cycle, _after_ you've figured
> out what wwent wrong.
> 
> Patch series looks good, though - if there's nothing you think needs
> special attention I'll go ahead and merge it.
> 

Not really, mainly just making sure the new locking and transaction
restart parts looked sane.

> I wonder if at some point we could clean up the indirect extent lookup,
> that's common between this code and the read paths, but that's low
> priority.
> 

I'll make a note to read more into that area. Thanks.

Brian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection
  2024-04-08 14:48 ` [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
@ 2024-04-18 14:46   ` Brian Foster
  2024-04-20  2:02     ` Kent Overstreet
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2024-04-18 14:46 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

On Mon, Apr 08, 2024 at 10:48:46AM -0400, Brian Foster wrote:
> bcachefs currently populates fiemap data from the extents btree.
> This works correctly when the fiemap sync flag is provided, but if
> not, it skips all delalloc extents that have not yet been flushed.
> This is because delalloc extents from buffered writes are first
> stored as reservation in the pagecache, and only become resident in
> the extents btree after writeback completes.
> 
> Update the fiemap implementation to process holes between extents by
> scanning pagecache for data, via seek data/hole. If a valid data
> range is found over a hole in the extent btree, fake up an extent
> key and flag the extent as delalloc for reporting to userspace.
> 
> Note that this does not necessarily change behavior for the case
> where there is dirty pagecache over already written extents, where
> when in COW mode, writeback will allocate new blocks for the
> underlying ranges. The existing behavior is consistent with btrfs
> and it is recommended to use the sync flag for the most up to date
> extent state from fiemap.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/bcachefs/fs.c | 87 +++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 79 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> index 9f14eb256c09..77a8c826b637 100644
> --- a/fs/bcachefs/fs.c
> +++ b/fs/bcachefs/fs.c
> @@ -993,6 +993,51 @@ static int bch2_fiemap_extent(struct btree_trans *trans,
>  	return 0;
>  }
>  
> +/*
> + * Scan a range of pagecache that corresponds to a file mapping hole in the
> + * extent btree. If found, fake up an extent key so it looks like a delalloc
> + * extent to the rest of the fiemap processing code.
> + *
> + * Returns 0 if cached data was found, -ENOENT if not, or -EAGAIN for folio
> + * trylock failure.
> + */
> +static int
> +bch2_fiemap_hole(struct inode *vinode, u64 start, u64 end,
> +		 struct bch_fiemap_extent *cur)
> +{
> +	struct bch_fs		*c = vinode->i_sb->s_fs_info;
> +	struct bch_inode_info	*ei = to_bch_ei(vinode);
> +	struct bkey_i_extent	*delextent;
> +	struct bch_extent_ptr	ptr = {};
> +	loff_t			dstart, dend;
> +
> +	/*
> +	 * We hold btree node locks here so we cannot block on folio locks.
> +	 * Propagate -EAGAIN errors so the caller can retry.
> +	 */
> +	dstart = bch2_seek_pagecache_data(vinode, start, end, 0, true);
> +	if (dstart < 0)
> +		return dstart;
> +	if (dstart >= end)
> +		return -ENOENT;
> +	dend = bch2_seek_pagecache_hole(vinode, dstart, end, 0, true);
> +
> +	/*
> +	 * Create a fake extent key in the buffer. We have to add a dummy extent
> +	 * pointer for the fill code to add an extent entry. It's explicitly
> +	 * zeroed to reflect delayed allocation (i.e. phys offset 0).
> +	 */
> +	bch2_bkey_buf_realloc(&cur->kbuf, c, sizeof(*delextent) / sizeof(u64));
> +	delextent = bkey_extent_init(cur->kbuf.k);
> +	delextent->k.p = POS(ei->v.i_ino, dstart >> 9);
> +	bch2_key_resize(&delextent->k, (dend - dstart) >> 9);
> +	bch2_bkey_append_ptr(&delextent->k_i, ptr);
> +
> +	cur->flags = FIEMAP_EXTENT_DELALLOC;
> +
> +	return 0;
> +}
> +
>  static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
>  		       u64 start, u64 len)
>  {
> @@ -1032,16 +1077,41 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
>  	while (!(ret = btree_trans_too_many_iters(trans)) &&
>  	       (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
>  	       !(ret = bkey_err(k))) {
> +		bool have_delalloc = false;
>  
> -		if (!bkey_extent_is_data(k.k) &&
> -		    k.k->type != KEY_TYPE_reservation) {
> -			bch2_btree_iter_advance(&iter);
> -			continue;
> +		/*
> +		 * If a hole exists before the start of the extent key, scan the
> +		 * range for pagecache data that might be pending writeback and
> +		 * thus not yet exist in the extent tree. Note that this can
> +		 * return -EAGAIN to request to retry folio locks.
> +		 */
> +		if (iter.pos.offset > start) {
> +			ret = bch2_fiemap_hole(vinode, start << 9,
> +					       iter.pos.offset << 9, &cur);
> +			if (!ret)
> +				have_delalloc = true;
> +			else if (ret != -ENOENT)
> +				break;
>  		}

So Kent had pointed out in an earlier chat that this needs to be a
little more robust wrt to transaction restart because there is the
potential for spinning under folio lock contention. I think there are a
couple ways to do this that don't necessarily require reworking a bunch
more of this code. One is to simply run a serialization sequence where
we drop trans locks, run a blocking pagecache scan, and then restart
this particular iteration of the loop to account for the fact that the
extent btree may have changed since the btree locks were dropped.

When thinking about that, it occurred to me that this is roughly
equivalent to the TOCTOU nature of fiemap in general, where we may or
may not have caught a delalloc extent before or after it was written
back. Therefore, it seems reasonable to just do the lock cycle, grab the
range of folios covering what was previously identified as a hole
(before dropping locks), and then just report it as delalloc as we would
have had we not dropped locks at all. The code for that seems less
kludgy and follows the allocate_dropping_locks() pattern.

So this is the relevant bit of code I'm working with currently:

                /*
                 * If a hole exists before the start of the extent key, scan the
                 * range for pagecache data that might be pending writeback and
                 * thus not yet exist in the extent tree.
                 *
                 * Since we hold extent btree locks here, we must trylock the
                 * folios to avoid deadlocks with the write path. On trylock
                 * failure, redo the scan with blocking locks in proper lock
                 * order.
                 *
                 * Technically this means that the extent btree could have been
                 * updated from folio writeback, but this is a TOCTOU race and
                 * so we just report the delalloc extent based on the state of
                 * the iter prior to the unlock.
                 */
                if (iter.pos.offset > start) {
                        ret = bch2_fiemap_hole(vinode, start << 9,
                                               iter.pos.offset << 9, &cur, true);
                        if (ret == -EAGAIN) {
                                ret = drop_locks_do(trans,
                                        bch2_fiemap_hole(vinode, start << 9,
                                                         iter.pos.offset << 9,
                                                         &cur, false));
                        }
                        if (!ret)
                                have_delalloc = true;
                        else if (ret != -ENOENT)
                                break;
                }

Note that the last param to bch2_fiemap_hole() is the nonblock param
lifted up from bch2_seek_pagecache_[data|hole](). I also haven't done
anything with -EAGAIN because it originates from the latter and is more
isolated than it was previously (actually I'm thinking about pushing
trans and the whole drop locks sequence down into bch2_fiemap_hole() if
this otherwise seems reasonable). Thoughts?

Brian

>  
> -		ret = bch2_fiemap_extent(trans, &iter, k, &cur);
> -		if (ret)
> -			break;
> +		 /* process the current key if there's no delalloc to report */
> +		if (!have_delalloc) {
> +			if (!bkey_extent_is_data(k.k) &&
> +			    k.k->type != KEY_TYPE_reservation) {
> +				start = bkey_start_offset(k.k) + k.k->size;
> +				bch2_btree_iter_advance(&iter);
> +				continue;
> +			}
> +
> +			ret = bch2_fiemap_extent(trans, &iter, k, &cur);
> +			if (ret)
> +				break;
> +		}
> +
> +		/*
> +		 * Store the current extent in prev so we can flag the last
> +		 * extent on the way out.
> +		 */
>  		bch2_bkey_buf_realloc(&prev.kbuf, c, cur.kbuf.k->k.u64s);
>  		start = cur.kbuf.k->k.p.offset;
>  
> @@ -1061,7 +1131,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
>  	start = iter.pos.offset;
>  	bch2_trans_iter_exit(trans, &iter);
>  err:
> -	if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
> +	if (ret == -EAGAIN ||
> +	    bch2_err_matches(ret, BCH_ERR_transaction_restart))
>  		goto retry;
>  
>  	if (!ret && have_extent) {
> -- 
> 2.44.0
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection
  2024-04-18 14:46   ` Brian Foster
@ 2024-04-20  2:02     ` Kent Overstreet
  0 siblings, 0 replies; 9+ messages in thread
From: Kent Overstreet @ 2024-04-20  2:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Apr 18, 2024 at 10:46:21AM -0400, Brian Foster wrote:
> So Kent had pointed out in an earlier chat that this needs to be a
> little more robust wrt to transaction restart because there is the
> potential for spinning under folio lock contention. I think there are a
> couple ways to do this that don't necessarily require reworking a bunch
> more of this code. One is to simply run a serialization sequence where
> we drop trans locks, run a blocking pagecache scan, and then restart
> this particular iteration of the loop to account for the fact that the
> extent btree may have changed since the btree locks were dropped.

The problem there was using bch2_trans_unlock() and relock() (basically
you were doing an open coded drop_locks_do()) without first attempting a
nonblocking version of the thing in the do.

drop_lock_do() has to be a slowpath thing; if it's something you always
do in a given transaction it causes livelocks.

the "try nonblocking; then drop_locks_do()" pattern works well because
for most use cases (allocating memory, taking a lock), if we end up
calling trans_unlock(), doing the blocking do, and then take a
transaction restart, the success of the blocking do means the
nonblocking do will likely succeed on the next iteration.

> When thinking about that, it occurred to me that this is roughly
> equivalent to the TOCTOU nature of fiemap in general, where we may or
> may not have caught a delalloc extent before or after it was written
> back. Therefore, it seems reasonable to just do the lock cycle, grab the
> range of folios covering what was previously identified as a hole
> (before dropping locks), and then just report it as delalloc as we would
> have had we not dropped locks at all. The code for that seems less
> kludgy and follows the allocate_dropping_locks() pattern.
> 
> So this is the relevant bit of code I'm working with currently:
> 
>                 /*
>                  * If a hole exists before the start of the extent key, scan the
>                  * range for pagecache data that might be pending writeback and
>                  * thus not yet exist in the extent tree.
>                  *
>                  * Since we hold extent btree locks here, we must trylock the
>                  * folios to avoid deadlocks with the write path. On trylock
>                  * failure, redo the scan with blocking locks in proper lock
>                  * order.
>                  *
>                  * Technically this means that the extent btree could have been
>                  * updated from folio writeback, but this is a TOCTOU race and
>                  * so we just report the delalloc extent based on the state of
>                  * the iter prior to the unlock.
>                  */
>                 if (iter.pos.offset > start) {
>                         ret = bch2_fiemap_hole(vinode, start << 9,
>                                                iter.pos.offset << 9, &cur, true);
>                         if (ret == -EAGAIN) {
>                                 ret = drop_locks_do(trans,
>                                         bch2_fiemap_hole(vinode, start << 9,
>                                                          iter.pos.offset << 9,
>                                                          &cur, false));
>                         }
>                         if (!ret)
>                                 have_delalloc = true;
>                         else if (ret != -ENOENT)
>                                 break;
>                 }
> 
> Note that the last param to bch2_fiemap_hole() is the nonblock param
> lifted up from bch2_seek_pagecache_[data|hole](). I also haven't done
> anything with -EAGAIN because it originates from the latter and is more
> isolated than it was previously (actually I'm thinking about pushing
> trans and the whole drop locks sequence down into bch2_fiemap_hole() if
> this otherwise seems reasonable). Thoughts?

this will give us tight synchronization between the extents btree and
the pagecache - it's as if we're locking both at the same time, no need
to think about TOCTOU issues - nice.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-04-20  2:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-08 14:48 [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
2024-04-08 14:48 ` [PATCH v3 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
2024-04-08 14:48 ` [PATCH v3 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
2024-04-08 14:48 ` [PATCH v3 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
2024-04-08 14:48 ` [PATCH v3 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
2024-04-18 14:46   ` Brian Foster
2024-04-20  2:02     ` Kent Overstreet
2024-04-09  3:33 ` [PATCH v3 0/4] bcachefs: fiemap delalloc support and cleanup Kent Overstreet
2024-04-09 12:21   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.