linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup
@ 2024-04-19 18:15 Brian Foster
  2024-04-19 18:15 ` [PATCH v4 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Brian Foster @ 2024-04-19 18:15 UTC (permalink / raw)
  To: linux-bcachefs

Hi all,

This is mostly equivalent to v3 except that bch2_fiemap_hole() now
implements a do_drop_locks() style pattern to manage locking order
between folios and the extent tree. The pagecache scan is factored out
into a new helper to facilitate the retry.

Patches also available in my test branch:

  https://evilpiepirate.org/~testdashboard/ci?branch=bfoster

Brian

v4:
- Implement a do_drop_locks() like pattern in bch2_fiemap_hole() to
  avoid deadlock and livelock or excessive spinning.
v3: https://lore.kernel.org/linux-bcachefs/20240408144846.1001243-1-bfoster@redhat.com/
- Use nonblocking pagecache seek to address extent btree vs. folio
  deadlocks.
v2: https://lore.kernel.org/linux-bcachefs/20240117155611.248042-1-bfoster@redhat.com/
- Clean up bch2_fiemap() a bit to use processing helpers.
- Added patch for sync flag fixup.
v1: https://lore.kernel.org/linux-bcachefs/20231219140215.300753-1-bfoster@redhat.com/

Brian Foster (4):
  bcachefs: drop duplicate fiemap sync flag
  bcachefs: track current fiemap offset in start variable
  bcachefs: refactor fiemap processing into extent helper and struct
  bcachefs: add fiemap delalloc extent detection

 fs/bcachefs/fs.c | 207 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 166 insertions(+), 41 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v4 1/4] bcachefs: drop duplicate fiemap sync flag
  2024-04-19 18:15 [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
@ 2024-04-19 18:15 ` Brian Foster
  2024-04-19 18:15 ` [PATCH v4 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Brian Foster @ 2024-04-19 18:15 UTC (permalink / raw)
  To: linux-bcachefs

FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67ad ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index de307e2323bb..f6cd14c9b38c 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -963,7 +963,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	u32 snapshot;
 	int ret = 0;
 
-	ret = fiemap_prep(&ei->v, info, start, &len, FIEMAP_FLAG_SYNC);
+	ret = fiemap_prep(&ei->v, info, start, &len, 0);
 	if (ret)
 		return ret;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v4 2/4] bcachefs: track current fiemap offset in start variable
  2024-04-19 18:15 [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
  2024-04-19 18:15 ` [PATCH v4 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
@ 2024-04-19 18:15 ` Brian Foster
  2024-04-19 18:15 ` [PATCH v4 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
  2024-04-19 18:15 ` [PATCH v4 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
  3 siblings, 0 replies; 5+ messages in thread
From: Brian Foster @ 2024-04-19 18:15 UTC (permalink / raw)
  To: linux-bcachefs

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index f6cd14c9b38c..8b806e33c6c9 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1011,6 +1011,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
 
 		sectors = min(sectors, k.k->size - offset_into_extent);
+		start = iter.pos.offset + sectors;
 
 		bch2_cut_front(POS(k.k->p.inode,
 				   bkey_start_offset(k.k) +
@@ -1031,8 +1032,7 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		bkey_copy(prev.k, cur.k);
 		have_extent = true;
 
-		bch2_btree_iter_set_pos(&iter,
-			POS(iter.pos.inode, iter.pos.offset + sectors));
+		bch2_btree_iter_set_pos(&iter, POS(iter.pos.inode, start));
 
 		ret = bch2_trans_relock(trans);
 		if (ret)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v4 3/4] bcachefs: refactor fiemap processing into extent helper and struct
  2024-04-19 18:15 [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
  2024-04-19 18:15 ` [PATCH v4 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
  2024-04-19 18:15 ` [PATCH v4 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
@ 2024-04-19 18:15 ` Brian Foster
  2024-04-19 18:15 ` [PATCH v4 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster
  3 siblings, 0 replies; 5+ messages in thread
From: Brian Foster @ 2024-04-19 18:15 UTC (permalink / raw)
  To: linux-bcachefs

The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.

Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.

No functional changes intended by this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 90 +++++++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 35 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 8b806e33c6c9..5c6af360a61f 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -892,10 +892,18 @@ static int bch2_tmpfile(struct mnt_idmap *idmap,
 	return finish_open_simple(file, 0);
 }
 
+struct bch_fiemap_extent {
+	struct bkey_buf	kbuf;
+	unsigned	flags;
+};
+
 static int bch2_fill_extent(struct bch_fs *c,
 			    struct fiemap_extent_info *info,
-			    struct bkey_s_c k, unsigned flags)
+			    struct bch_fiemap_extent *fe)
 {
+	struct bkey_s_c k = bkey_i_to_s_c(fe->kbuf.k);
+	unsigned flags = fe->flags;
+
 	if (bkey_extent_is_direct_data(k.k)) {
 		struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 		const union bch_extent_entry *entry;
@@ -948,6 +956,39 @@ static int bch2_fill_extent(struct bch_fs *c,
 	}
 }
 
+static int bch2_fiemap_extent(struct btree_trans *trans,
+			      struct btree_iter *iter, struct bkey_s_c k,
+			      struct bch_fiemap_extent *cur)
+{
+	unsigned	offset_into_extent, sectors;
+	enum btree_id	data_btree = BTREE_ID_extents;
+	int		ret;
+
+	offset_into_extent	= iter->pos.offset - bkey_start_offset(k.k);
+	sectors			= k.k->size - offset_into_extent;
+
+	bch2_bkey_buf_reassemble(&cur->kbuf, trans->c, k);
+
+	ret = bch2_read_indirect_extent(trans, &data_btree, &offset_into_extent,
+					&cur->kbuf);
+	if (ret)
+		return ret;
+
+	k = bkey_i_to_s_c(cur->kbuf.k);
+	sectors = min(sectors, k.k->size - offset_into_extent);
+
+	bch2_cut_front(POS(k.k->p.inode,
+			   bkey_start_offset(k.k) + offset_into_extent),
+		       cur->kbuf.k);
+	bch2_key_resize(&cur->kbuf.k->k, sectors);
+	cur->kbuf.k->k.p = iter->pos;
+	cur->kbuf.k->k.p.offset += cur->kbuf.k->k.size;
+
+	cur->flags = 0;
+
+	return 0;
+}
+
 static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		       u64 start, u64 len)
 {
@@ -956,9 +997,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	struct btree_trans *trans;
 	struct btree_iter iter;
 	struct bkey_s_c k;
-	struct bkey_buf cur, prev;
+	struct bch_fiemap_extent cur, prev;
 	struct bpos end = POS(ei->v.i_ino, (start + len) >> 9);
-	unsigned offset_into_extent, sectors;
 	bool have_extent = false;
 	u32 snapshot;
 	int ret = 0;
@@ -972,8 +1012,8 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 
 	start >>= 9;
 
-	bch2_bkey_buf_init(&cur);
-	bch2_bkey_buf_init(&prev);
+	bch2_bkey_buf_init(&cur.kbuf);
+	bch2_bkey_buf_init(&prev.kbuf);
 	trans = bch2_trans_get(c);
 retry:
 	bch2_trans_begin(trans);
@@ -988,7 +1028,6 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	while (!(ret = btree_trans_too_many_iters(trans)) &&
 	       (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
 	       !(ret = bkey_err(k))) {
-		enum btree_id data_btree = BTREE_ID_extents;
 
 		if (!bkey_extent_is_data(k.k) &&
 		    k.k->type != KEY_TYPE_reservation) {
@@ -996,40 +1035,21 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 			continue;
 		}
 
-		offset_into_extent	= iter.pos.offset -
-			bkey_start_offset(k.k);
-		sectors			= k.k->size - offset_into_extent;
-
-		bch2_bkey_buf_reassemble(&cur, c, k);
-
-		ret = bch2_read_indirect_extent(trans, &data_btree,
-					&offset_into_extent, &cur);
+		ret = bch2_fiemap_extent(trans, &iter, k, &cur);
 		if (ret)
 			break;
-
-		k = bkey_i_to_s_c(cur.k);
-		bch2_bkey_buf_realloc(&prev, c, k.k->u64s);
-
-		sectors = min(sectors, k.k->size - offset_into_extent);
-		start = iter.pos.offset + sectors;
-
-		bch2_cut_front(POS(k.k->p.inode,
-				   bkey_start_offset(k.k) +
-				   offset_into_extent),
-			       cur.k);
-		bch2_key_resize(&cur.k->k, sectors);
-		cur.k->k.p = iter.pos;
-		cur.k->k.p.offset += cur.k->k.size;
+		bch2_bkey_buf_realloc(&prev.kbuf, c, cur.kbuf.k->k.u64s);
+		start = cur.kbuf.k->k.p.offset;
 
 		if (have_extent) {
 			bch2_trans_unlock(trans);
-			ret = bch2_fill_extent(c, info,
-					bkey_i_to_s_c(prev.k), 0);
+			ret = bch2_fill_extent(c, info, &prev);
 			if (ret)
 				break;
 		}
 
-		bkey_copy(prev.k, cur.k);
+		bkey_copy(prev.kbuf.k, cur.kbuf.k);
+		prev.flags = cur.flags;
 		have_extent = true;
 
 		bch2_btree_iter_set_pos(&iter, POS(iter.pos.inode, start));
@@ -1046,13 +1066,13 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 
 	if (!ret && have_extent) {
 		bch2_trans_unlock(trans);
-		ret = bch2_fill_extent(c, info, bkey_i_to_s_c(prev.k),
-				       FIEMAP_EXTENT_LAST);
+		prev.flags |= FIEMAP_EXTENT_LAST;
+		ret = bch2_fill_extent(c, info, &prev);
 	}
 
 	bch2_trans_put(trans);
-	bch2_bkey_buf_exit(&cur, c);
-	bch2_bkey_buf_exit(&prev, c);
+	bch2_bkey_buf_exit(&cur.kbuf, c);
+	bch2_bkey_buf_exit(&prev.kbuf, c);
 	return ret < 0 ? ret : 0;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v4 4/4] bcachefs: add fiemap delalloc extent detection
  2024-04-19 18:15 [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
                   ` (2 preceding siblings ...)
  2024-04-19 18:15 ` [PATCH v4 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
@ 2024-04-19 18:15 ` Brian Foster
  3 siblings, 0 replies; 5+ messages in thread
From: Brian Foster @ 2024-04-19 18:15 UTC (permalink / raw)
  To: linux-bcachefs

bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.

Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.

Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/fs.c | 119 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 5c6af360a61f..b5f95a929bb2 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -989,6 +989,87 @@ static int bch2_fiemap_extent(struct btree_trans *trans,
 	return 0;
 }
 
+/*
+ * Scan a range of an inode for data in pagecache.
+ *
+ * Intended to be retryable, so don't modify the output params until success is
+ * imminent.
+ */
+static int
+bch2_fiemap_hole_pagecache(struct inode *vinode, u64 *start, u64 *end,
+			   bool nonblock)
+{
+	loff_t	dstart, dend;
+
+	dstart = bch2_seek_pagecache_data(vinode, *start, *end, 0, nonblock);
+	if (dstart < 0)
+		return dstart;
+	if (dstart >= *end)
+		return -ENOENT;
+
+	dend = bch2_seek_pagecache_hole(vinode, dstart, *end, 0, nonblock);
+	if (dend < 0)
+		return dend;
+
+	*start = dstart;
+	*end = dend;
+	return 0;
+}
+
+/*
+ * Scan a range of pagecache that corresponds to a file mapping hole in the
+ * extent btree. If data is found, fake up an extent key so it looks like a
+ * delalloc extent to the rest of the fiemap processing code.
+ *
+ * Returns 0 if cached data was found, -ENOENT if not.
+ */
+static int
+bch2_fiemap_hole(struct btree_trans *trans, struct inode *vinode, u64 start,
+		 u64 end, struct bch_fiemap_extent *cur)
+{
+	struct bch_fs		*c = vinode->i_sb->s_fs_info;
+	struct bch_inode_info	*ei = to_bch_ei(vinode);
+	struct bkey_i_extent	*delextent;
+	struct bch_extent_ptr	ptr = {};
+	loff_t			dstart = start, dend = end;
+	int			ret;
+
+	/*
+	 * We hold btree locks here so we cannot block on folio locks without
+	 * dropping trans locks first. Run a nonblocking scan for the common
+	 * case of no folios over holes and fall back on failure.
+	 *
+	 * Note that dropping locks like this is technically racy against
+	 * writeback inserting to the extent tree, but a non-sync fiemap scan is
+	 * fundamentally racy with writeback anyways. Therefore, just report the
+	 * range as delalloc regardless of whether we have to cycle trans locks.
+	 */
+	ret = bch2_fiemap_hole_pagecache(vinode, &dstart, &dend, true);
+	if (ret == -EAGAIN) {
+		/* open coded drop_locks_do() to relock even on error */
+		bch2_trans_unlock(trans);
+		ret = bch2_fiemap_hole_pagecache(vinode, &dstart, &dend, false);
+		bch2_trans_relock(trans);
+	}
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Create a fake extent key in the buffer. We have to add a dummy extent
+	 * pointer for the fill code to add an extent entry. It's explicitly
+	 * zeroed to reflect delayed allocation (i.e. phys offset 0).
+	 */
+	bch2_bkey_buf_realloc(&cur->kbuf, c, sizeof(*delextent) / sizeof(u64));
+	delextent = bkey_extent_init(cur->kbuf.k);
+	delextent->k.p = POS(ei->v.i_ino, dstart >> 9);
+	bch2_key_resize(&delextent->k, (dend - dstart) >> 9);
+	bch2_bkey_append_ptr(&delextent->k_i, ptr);
+
+	cur->flags = FIEMAP_EXTENT_DELALLOC;
+
+	return 0;
+}
+
 static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 		       u64 start, u64 len)
 {
@@ -1028,16 +1109,40 @@ static int bch2_fiemap(struct inode *vinode, struct fiemap_extent_info *info,
 	while (!(ret = btree_trans_too_many_iters(trans)) &&
 	       (k = bch2_btree_iter_peek_upto(&iter, end)).k &&
 	       !(ret = bkey_err(k))) {
+		bool have_delalloc = false;
 
-		if (!bkey_extent_is_data(k.k) &&
-		    k.k->type != KEY_TYPE_reservation) {
-			bch2_btree_iter_advance(&iter);
-			continue;
+		/*
+		 * If a hole exists before the start of the extent key, scan the
+		 * range for pagecache data that might be pending writeback and
+		 * thus not yet exist in the extent tree.
+		 */
+		if (iter.pos.offset > start) {
+			ret = bch2_fiemap_hole(trans, vinode, start << 9,
+					       iter.pos.offset << 9, &cur);
+			if (!ret)
+				have_delalloc = true;
+			else if (ret != -ENOENT)
+				break;
 		}
 
-		ret = bch2_fiemap_extent(trans, &iter, k, &cur);
-		if (ret)
-			break;
+		 /* process the current key if there's no delalloc to report */
+		if (!have_delalloc) {
+			if (!bkey_extent_is_data(k.k) &&
+			    k.k->type != KEY_TYPE_reservation) {
+				start = bkey_start_offset(k.k) + k.k->size;
+				bch2_btree_iter_advance(&iter);
+				continue;
+			}
+
+			ret = bch2_fiemap_extent(trans, &iter, k, &cur);
+			if (ret)
+				break;
+		}
+
+		/*
+		 * Store the current extent in prev so we can flag the last
+		 * extent on the way out.
+		 */
 		bch2_bkey_buf_realloc(&prev.kbuf, c, cur.kbuf.k->k.u64s);
 		start = cur.kbuf.k->k.p.offset;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-19 18:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-19 18:15 [PATCH v4 0/4] bcachefs: fiemap delalloc support and cleanup Brian Foster
2024-04-19 18:15 ` [PATCH v4 1/4] bcachefs: drop duplicate fiemap sync flag Brian Foster
2024-04-19 18:15 ` [PATCH v4 2/4] bcachefs: track current fiemap offset in start variable Brian Foster
2024-04-19 18:15 ` [PATCH v4 3/4] bcachefs: refactor fiemap processing into extent helper and struct Brian Foster
2024-04-19 18:15 ` [PATCH v4 4/4] bcachefs: add fiemap delalloc extent detection Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).