[PATCH RFC] bcachefs: use inode as write point index instead of task

* [PATCH RFC] bcachefs: use inode as write point index instead of task
@ 2022-12-12 19:06 Brian Foster
  2022-12-13 18:38 ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-12 19:06 UTC (permalink / raw)
  To: linux-bcachefs

Use the pointer to the current in-core inode object as the hash of
the write point associated with block allocation for buffered
writes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi bcachefs folks,

I'm posting this RFC more as a discussion point than a direct proposal.
I've been poking around a bit at bcachefs just to grok some of the
basics and whatnot and ended up digging down into the block allocation
code a bit. From there I was trying to make sense of the bucket and
write point abstractions, and then noticed that the write point
structure used to track allocation/write contexts is hashed by task.
IIUC, that means a single task that might be writing to multiple files
will do so via a single write point, or conversely multiple tasks that
might be writing to a single file would do so with multiple write points
(one associated with each task). In turn, that means the allocation
layout of a particular file might be a characteristic of the set of
tasks that initially wrote the file as opposed to something more
traditional based on inode locality.

In thinking about this a bit more, it seems logical that this derives
from bcache because it seems like a perfectly good cache allocation
policy. I suppose it might make sense in certain filesystem workloads
where per-file contiguity might no longer necessarily be the most
important thing in the world, but I'm curious if that is intentional for
bcachefs (the approach is documented, after all) and if there are any
particular workloads or use cases in mind for that sort of approach?

Even if it makes sense in some cases, I wonder whether it's the ideal
approach across the board. For example, consider how the task is tracked
through the inode from buffered write to writeback contexts so the
latter can locate the write point for the last task that wrote to the
file. Even if multiple tasks perform buffered writes to a file, we only
really use the last task that wrote to the file before the current
writeback cycle. Absent frequent writebacks, wouldn't it make some sense
to just associate the write point with the current inode pointer and use
that to minimize per-inode fragmentation?

FWIW, I was able to run some oddball workloads that produce excessive
fragmentation by forcing frequent writeback activity, but my tests so
far are ad hoc and moreso just playing around to help understand
behavior. To be fair, more traditional filesystems don't necessarily
handle the same sort of tests terribly well either. What I found more
interesting is that with the small tweak from this RFC, the allocation
layout from bcachefs went from notably worse to notably better in that
kind of pathological workload. Thoughts?

Brian

 fs/bcachefs/fs-io.c | 5 +----
 fs/bcachefs/fs.h    | 1 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/bcachefs/fs-io.c b/fs/bcachefs/fs-io.c
index e7ebb01b4d09..f0cb32f2c437 100644
--- a/fs/bcachefs/fs-io.c
+++ b/fs/bcachefs/fs-io.c
@@ -1389,7 +1389,7 @@ static void bch2_writepage_io_alloc(struct bch_fs *c,
 	op->target		= w->opts.foreground_target;
 	op->nr_replicas		= nr_replicas;
 	op->res.nr_replicas	= nr_replicas;
-	op->write_point		= writepoint_hashed(inode->ei_last_dirtied);
+	op->write_point		= writepoint_hashed((unsigned long) inode);
 	op->subvol		= inode->ei_subvol;
 	op->pos			= POS(inode->v.i_ino, sector);
 	op->end_io		= bch2_writepage_io_done;
@@ -1669,8 +1669,6 @@ int bch2_write_end(struct file *file, struct address_space *mapping,
 			SetPageUptodate(page);
 
 		bch2_set_page_dirty(c, inode, page, res, offset, copied);
-
-		inode->ei_last_dirtied = (unsigned long) current;
 	}
 
 	unlock_page(page);
@@ -1823,7 +1821,6 @@ static int __bch2_buffered_write(struct bch_inode_info *inode,
 	}
 
 	nr_pages_copied = DIV_ROUND_UP(offset + copied, PAGE_SIZE);
-	inode->ei_last_dirtied = (unsigned long) current;
 out:
 	for (i = nr_pages_copied; i < nr_pages; i++) {
 		unlock_page(pages[i]);
diff --git a/fs/bcachefs/fs.h b/fs/bcachefs/fs.h
index 6b91bbe91116..c963994a104a 100644
--- a/fs/bcachefs/fs.h
+++ b/fs/bcachefs/fs.h
@@ -17,7 +17,6 @@ struct bch_inode_info {
 
 	struct mutex		ei_update_lock;
 	u64			ei_quota_reserved;
-	unsigned long		ei_last_dirtied;
 
 	two_state_lock_t	ei_pagecache_lock;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread