All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] bcachefs: use inode as write point index instead of task
@ 2022-12-12 19:06 Brian Foster
  2022-12-13 18:38 ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-12 19:06 UTC (permalink / raw)
  To: linux-bcachefs

Use the pointer to the current in-core inode object as the hash of
the write point associated with block allocation for buffered
writes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi bcachefs folks,

I'm posting this RFC more as a discussion point than a direct proposal.
I've been poking around a bit at bcachefs just to grok some of the
basics and whatnot and ended up digging down into the block allocation
code a bit. From there I was trying to make sense of the bucket and
write point abstractions, and then noticed that the write point
structure used to track allocation/write contexts is hashed by task.
IIUC, that means a single task that might be writing to multiple files
will do so via a single write point, or conversely multiple tasks that
might be writing to a single file would do so with multiple write points
(one associated with each task). In turn, that means the allocation
layout of a particular file might be a characteristic of the set of
tasks that initially wrote the file as opposed to something more
traditional based on inode locality.

In thinking about this a bit more, it seems logical that this derives
from bcache because it seems like a perfectly good cache allocation
policy. I suppose it might make sense in certain filesystem workloads
where per-file contiguity might no longer necessarily be the most
important thing in the world, but I'm curious if that is intentional for
bcachefs (the approach is documented, after all) and if there are any
particular workloads or use cases in mind for that sort of approach?

Even if it makes sense in some cases, I wonder whether it's the ideal
approach across the board. For example, consider how the task is tracked
through the inode from buffered write to writeback contexts so the
latter can locate the write point for the last task that wrote to the
file. Even if multiple tasks perform buffered writes to a file, we only
really use the last task that wrote to the file before the current
writeback cycle. Absent frequent writebacks, wouldn't it make some sense
to just associate the write point with the current inode pointer and use
that to minimize per-inode fragmentation?

FWIW, I was able to run some oddball workloads that produce excessive
fragmentation by forcing frequent writeback activity, but my tests so
far are ad hoc and moreso just playing around to help understand
behavior. To be fair, more traditional filesystems don't necessarily
handle the same sort of tests terribly well either. What I found more
interesting is that with the small tweak from this RFC, the allocation
layout from bcachefs went from notably worse to notably better in that
kind of pathological workload. Thoughts?

Brian

 fs/bcachefs/fs-io.c | 5 +----
 fs/bcachefs/fs.h    | 1 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/bcachefs/fs-io.c b/fs/bcachefs/fs-io.c
index e7ebb01b4d09..f0cb32f2c437 100644
--- a/fs/bcachefs/fs-io.c
+++ b/fs/bcachefs/fs-io.c
@@ -1389,7 +1389,7 @@ static void bch2_writepage_io_alloc(struct bch_fs *c,
 	op->target		= w->opts.foreground_target;
 	op->nr_replicas		= nr_replicas;
 	op->res.nr_replicas	= nr_replicas;
-	op->write_point		= writepoint_hashed(inode->ei_last_dirtied);
+	op->write_point		= writepoint_hashed((unsigned long) inode);
 	op->subvol		= inode->ei_subvol;
 	op->pos			= POS(inode->v.i_ino, sector);
 	op->end_io		= bch2_writepage_io_done;
@@ -1669,8 +1669,6 @@ int bch2_write_end(struct file *file, struct address_space *mapping,
 			SetPageUptodate(page);
 
 		bch2_set_page_dirty(c, inode, page, res, offset, copied);
-
-		inode->ei_last_dirtied = (unsigned long) current;
 	}
 
 	unlock_page(page);
@@ -1823,7 +1821,6 @@ static int __bch2_buffered_write(struct bch_inode_info *inode,
 	}
 
 	nr_pages_copied = DIV_ROUND_UP(offset + copied, PAGE_SIZE);
-	inode->ei_last_dirtied = (unsigned long) current;
 out:
 	for (i = nr_pages_copied; i < nr_pages; i++) {
 		unlock_page(pages[i]);
diff --git a/fs/bcachefs/fs.h b/fs/bcachefs/fs.h
index 6b91bbe91116..c963994a104a 100644
--- a/fs/bcachefs/fs.h
+++ b/fs/bcachefs/fs.h
@@ -17,7 +17,6 @@ struct bch_inode_info {
 
 	struct mutex		ei_update_lock;
 	u64			ei_quota_reserved;
-	unsigned long		ei_last_dirtied;
 
 	two_state_lock_t	ei_pagecache_lock;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
@ 2022-12-13 18:38 ` Kent Overstreet
  2022-12-14 17:44   ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-13 18:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Dec 12, 2022 at 02:06:02PM -0500, Brian Foster wrote:
> Use the pointer to the current in-core inode object as the hash of
> the write point associated with block allocation for buffered
> writes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi bcachefs folks,
> 
> I'm posting this RFC more as a discussion point than a direct proposal.
> I've been poking around a bit at bcachefs just to grok some of the
> basics and whatnot and ended up digging down into the block allocation
> code a bit. From there I was trying to make sense of the bucket and
> write point abstractions, and then noticed that the write point
> structure used to track allocation/write contexts is hashed by task.
> IIUC, that means a single task that might be writing to multiple files
> will do so via a single write point, or conversely multiple tasks that
> might be writing to a single file would do so with multiple write points
> (one associated with each task). In turn, that means the allocation
> layout of a particular file might be a characteristic of the set of
> tasks that initially wrote the file as opposed to something more
> traditional based on inode locality.
> 
> In thinking about this a bit more, it seems logical that this derives
> from bcache because it seems like a perfectly good cache allocation
> policy. I suppose it might make sense in certain filesystem workloads
> where per-file contiguity might no longer necessarily be the most
> important thing in the world, but I'm curious if that is intentional for
> bcachefs (the approach is documented, after all) and if there are any
> particular workloads or use cases in mind for that sort of approach?
> 
> Even if it makes sense in some cases, I wonder whether it's the ideal
> approach across the board. For example, consider how the task is tracked
> through the inode from buffered write to writeback contexts so the
> latter can locate the write point for the last task that wrote to the
> file. Even if multiple tasks perform buffered writes to a file, we only
> really use the last task that wrote to the file before the current
> writeback cycle. Absent frequent writebacks, wouldn't it make some sense
> to just associate the write point with the current inode pointer and use
> that to minimize per-inode fragmentation?

Fundamentally, what we're doing is trying to preserve locality, the idea is that
data written by the same process is likely related and should be written out
together.

Consider a cp -r of many small files, or any sort of process that's writing out
to many small files. Using the pid for the writepoint preserves locality, using
the inode number would not.

> FWIW, I was able to run some oddball workloads that produce excessive
> fragmentation by forcing frequent writeback activity, but my tests so
> far are ad hoc and moreso just playing around to help understand
> behavior. To be fair, more traditional filesystems don't necessarily
> handle the same sort of tests terribly well either. What I found more
> interesting is that with the small tweak from this RFC, the allocation
> layout from bcachefs went from notably worse to notably better in that
> kind of pathological workload. Thoughts?

Could you describe more what you were seeing and what the changes were? I think
there is room for improvement here, but we're going to have to carefully think
about and describe what we're trying to acheive.

It's not clear to me how this would have an effect at all in the test you're
describing; forcing frequent writebacks is going to hurt in a COW filesystem,
since every write is an allocation - need more information here.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-13 18:38 ` Kent Overstreet
@ 2022-12-14 17:44   ` Brian Foster
  2022-12-16  0:09     ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-14 17:44 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Tue, Dec 13, 2022 at 01:38:33PM -0500, Kent Overstreet wrote:
> On Mon, Dec 12, 2022 at 02:06:02PM -0500, Brian Foster wrote:
> > Use the pointer to the current in-core inode object as the hash of
> > the write point associated with block allocation for buffered
> > writes.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi bcachefs folks,
> > 
> > I'm posting this RFC more as a discussion point than a direct proposal.
> > I've been poking around a bit at bcachefs just to grok some of the
> > basics and whatnot and ended up digging down into the block allocation
> > code a bit. From there I was trying to make sense of the bucket and
> > write point abstractions, and then noticed that the write point
> > structure used to track allocation/write contexts is hashed by task.
> > IIUC, that means a single task that might be writing to multiple files
> > will do so via a single write point, or conversely multiple tasks that
> > might be writing to a single file would do so with multiple write points
> > (one associated with each task). In turn, that means the allocation
> > layout of a particular file might be a characteristic of the set of
> > tasks that initially wrote the file as opposed to something more
> > traditional based on inode locality.
> > 
> > In thinking about this a bit more, it seems logical that this derives
> > from bcache because it seems like a perfectly good cache allocation
> > policy. I suppose it might make sense in certain filesystem workloads
> > where per-file contiguity might no longer necessarily be the most
> > important thing in the world, but I'm curious if that is intentional for
> > bcachefs (the approach is documented, after all) and if there are any
> > particular workloads or use cases in mind for that sort of approach?
> > 
> > Even if it makes sense in some cases, I wonder whether it's the ideal
> > approach across the board. For example, consider how the task is tracked
> > through the inode from buffered write to writeback contexts so the
> > latter can locate the write point for the last task that wrote to the
> > file. Even if multiple tasks perform buffered writes to a file, we only
> > really use the last task that wrote to the file before the current
> > writeback cycle. Absent frequent writebacks, wouldn't it make some sense
> > to just associate the write point with the current inode pointer and use
> > that to minimize per-inode fragmentation?
> 
> Fundamentally, what we're doing is trying to preserve locality, the idea is that
> data written by the same process is likely related and should be written out
> together.
> 
> Consider a cp -r of many small files, or any sort of process that's writing out
> to many small files. Using the pid for the writepoint preserves locality, using
> the inode number would not.
> 

Yeah, that makes sense. I'm more curious about the value and tradeoff of
inter-file vs. intra-file locality and the associated workloads that
might or might not benefit. I suppose I can see value of a single writer
copying in a bunch of small files and the allocations being streamlined
as such. OTOH, what's the cost if separate files don't happen to follow
the same write point like this (i.e. as if cp were run one file at a
time)? Is there some indirect/internal to bcachefs benefit to this
behavior, or are we just trying to batch writes?

What about scenarios where the writing task is less relevant? For
example, perhaps something like a file server ingesting data for
multiple sets of large files via a thread pool or some such.. I'm not
sure I see as much value of a worker task doling out per-task hinted
allocations across a set of the N active files it might be processing..

Note that I'm just probing here; trying to understand the origin of this
heuristic, where it might or might not apply between bcache and
bcachefs, etc. I'm not familiar with bcache and have only just scratched
the surface of bcachefs to this point and this just happened to be the
first higher level question that arose when reading through the
allocation code.

The reason it stood out to me is that I'm not aware of fs' going through
major pains to preserve per-task locality across inodes in this sort of
way (XFS, for example, may deliberately spread files over the disk if
they happen to span directories, perhaps for different reasons). OTOH,
we do put effort into reducing per-file fragmentation in scenarios like
concurrent sustained buffered writes or random write/alloc workloads to
sparse files, and then usually hear from users one way or another when
those optimizations aren't effective enough. ;)

> > FWIW, I was able to run some oddball workloads that produce excessive
> > fragmentation by forcing frequent writeback activity, but my tests so
> > far are ad hoc and moreso just playing around to help understand
> > behavior. To be fair, more traditional filesystems don't necessarily
> > handle the same sort of tests terribly well either. What I found more
> > interesting is that with the small tweak from this RFC, the allocation
> > layout from bcachefs went from notably worse to notably better in that
> > kind of pathological workload. Thoughts?
> 
> Could you describe more what you were seeing and what the changes were? I think
> there is room for improvement here, but we're going to have to carefully think
> about and describe what we're trying to acheive.
> 

Eh, I'd have to go back at play around with it to provide specifics.. I
suspect I was doing something like multiple, concurrent, appending
writes to a single file from different tasks, also concurrent with a set
of sync_file_range() workers forcing frequent writeback activity, and
then seeing a notably large number of extents for the relative size of
the file. That turned around completely with the tweak in this patch.

This test wasn't intended to reflect any sort of sane workload. Rather,
to just try and confirm assumptions made from reading the code. I've not
actually tested anything like the more practical example mentioned
above, and I wouldn't expect it to result in anything nearly as
pathological as this test.

Did you have specific improvements or target use cases in mind in this
area..?

> It's not clear to me how this would have an effect at all in the test you're
> describing; forcing frequent writebacks is going to hurt in a COW filesystem,
> since every write is an allocation - need more information here.
> 

Yeah, I was just trying to wrap my head around the initial allocation
case for the time being..

Brian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-14 17:44   ` Brian Foster
@ 2022-12-16  0:09     ` Kent Overstreet
  2022-12-16  7:04       ` Kent Overstreet
  2022-12-19 15:27       ` Brian Foster
  0 siblings, 2 replies; 17+ messages in thread
From: Kent Overstreet @ 2022-12-16  0:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Dec 14, 2022 at 12:44:52PM -0500, Brian Foster wrote:
> On Tue, Dec 13, 2022 at 01:38:33PM -0500, Kent Overstreet wrote:
> > Fundamentally, what we're doing is trying to preserve locality, the idea is that
> > data written by the same process is likely related and should be written out
> > together.
> > 
> > Consider a cp -r of many small files, or any sort of process that's writing out
> > to many small files. Using the pid for the writepoint preserves locality, using
> > the inode number would not.
> > 
> 
> Yeah, that makes sense. I'm more curious about the value and tradeoff of
> inter-file vs. intra-file locality and the associated workloads that
> might or might not benefit. I suppose I can see value of a single writer
> copying in a bunch of small files and the allocations being streamlined
> as such. OTOH, what's the cost if separate files don't happen to follow
> the same write point like this (i.e. as if cp were run one file at a
> time)? Is there some indirect/internal to bcachefs benefit to this
> behavior, or are we just trying to batch writes?

There's two main factors.

Firstly, we've only got so many write points, so we're more likely to end up
using the same write points as another application, and interspersing our data
with theirs.

The big factor is bucket based allocation and copygc: preserving locality is
more important in bcachefs than in other filesystems, because we have to
evacuate a bucket before it can be reused. Keeping related data grouped together
in the same bucket means that it's much more likely to all die (be deleted or
overwritten) at the same time, and copygc won't have to do any work.

(Side note: bucket based allocation and copygc is what will make really good
zoned device supoprt possible, which has been part of the design since forever
and something I want to see finished - I've just been focusing on more core
things).

> What about scenarios where the writing task is less relevant? For
> example, perhaps something like a file server ingesting data for
> multiple sets of large files via a thread pool or some such.. I'm not
> sure I see as much value of a worker task doling out per-task hinted
> allocations across a set of the N active files it might be processing..

There's been talk of plumbing a write point API up to userspace before - SSDs
could make good use of it if they had it.

Or, perhaps we could detect when a task is writing to multiple files at the same
time (not one after the other, as cp would) and switch to writepoint-per-file
then.

> The reason it stood out to me is that I'm not aware of fs' going through
> major pains to preserve per-task locality across inodes in this sort of
> way (XFS, for example, may deliberately spread files over the disk if
> they happen to span directories, perhaps for different reasons). OTOH,
> we do put effort into reducing per-file fragmentation in scenarios like
> concurrent sustained buffered writes or random write/alloc workloads to
> sparse files, and then usually hear from users one way or another when
> those optimizations aren't effective enough. ;)

Maybe you could tell me more? I don't know nearly as much as I'd like to about
allocator strategies in XFS; I've never gone as deep here as I'd like beacuse,
quite honestly, people haven't complained :)

(And the company funding me cares very much about streaming bandwidth on
rotating disk, it's something we've looked at. The bigger issue last time we
looked was that readdir order in bcachefs doesn't match create order or sort
order, because directories are implemented as simple open addressing hash
tables - meaning reading back files in a directory incurred unnecessary seeks on
file boundaries. Another todo list item :)

> Eh, I'd have to go back at play around with it to provide specifics.. I
> suspect I was doing something like multiple, concurrent, appending
> writes to a single file from different tasks, also concurrent with a set
> of sync_file_range() workers forcing frequent writeback activity, and
> then seeing a notably large number of extents for the relative size of
> the file. That turned around completely with the tweak in this patch.
> 
> This test wasn't intended to reflect any sort of sane workload. Rather,
> to just try and confirm assumptions made from reading the code. I've not
> actually tested anything like the more practical example mentioned
> above, and I wouldn't expect it to result in anything nearly as
> pathological as this test.

*nod*

> Did you have specific improvements or target use cases in mind in this
> area..?

I do think we could probably be doing something more than using the pid for the
writepoint, I've just been waiting until we see specific workloads where the
current behaviour falls over or have a specific complaint before designing
something new.

Target use cases - I want bcachefs to be able to handle any mainstream
application XFS can (as well as brtfs/ext4/zfs) - i.e. be a truly general
purpose filesystem. Anything needed to make that happen is on the table, long
term - including potentially switching out the entire bucket-based allocator for
something else, as an option - the codebase is flexible enough that we could do
that.

I doubt it'll come to that though, especially now that we've got nocow mode - my
expectation is that we'll be able to get it to do what we need to with perhaps
minor tweaks to writepoint usage. Always contingent upon more data, of course.

More immediately, I'm trying to set things up so that we've got good visibility
into what's going on can see what needs to be tweaked and tuned later.

One big thing we'll want to watch for is excessive copygc - that'll be an
indication our data layout could've been better. We've got backpointers from
buckets back to extents and inodes back to dirents, which means we could pretty
easily have the copygc tracepoints start telling us the specific filenames it
was moving data for.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-16  0:09     ` Kent Overstreet
@ 2022-12-16  7:04       ` Kent Overstreet
  2022-12-19 15:42         ` Brian Foster
  2022-12-19 15:27       ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-16  7:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> I do think we could probably be doing something more than using the pid for the
> writepoint, I've just been waiting until we see specific workloads where the
> current behaviour falls over or have a specific complaint before designing
> something new.

Random late night thoughts:

Say we introduce a new object, 'writepoint_handle' or somesuch.

Allocate them when opening a file for write, close them when close the file.

Then we'd be explicitly picking which writepoint to use when allocating the
writepoint_handle; it would be easy to add logic for "if there's a writepoint
which was last used by this process and doesn't currently have any handles
pointing to it, use that".

One of the things that needs to be considered is - what do we do when there's
more writepoint_handles than writepoints?

bcache has some logic for this by tracking when a writepoint was used, and if we
don't find a writepoint that matches up with the IO being issued - pick the
oldest one off an LRU queue. Was dropped in bcachefs because the straight hash
table seemed to work just as well and was faster - or maybe I'm thinking of the
sequential bypass data structure?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-16  0:09     ` Kent Overstreet
  2022-12-16  7:04       ` Kent Overstreet
@ 2022-12-19 15:27       ` Brian Foster
  2022-12-20  1:02         ` Kent Overstreet
  1 sibling, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-19 15:27 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> On Wed, Dec 14, 2022 at 12:44:52PM -0500, Brian Foster wrote:
> > On Tue, Dec 13, 2022 at 01:38:33PM -0500, Kent Overstreet wrote:
> > > Fundamentally, what we're doing is trying to preserve locality, the idea is that
> > > data written by the same process is likely related and should be written out
> > > together.
> > > 
> > > Consider a cp -r of many small files, or any sort of process that's writing out
> > > to many small files. Using the pid for the writepoint preserves locality, using
> > > the inode number would not.
> > > 
> > 
> > Yeah, that makes sense. I'm more curious about the value and tradeoff of
> > inter-file vs. intra-file locality and the associated workloads that
> > might or might not benefit. I suppose I can see value of a single writer
> > copying in a bunch of small files and the allocations being streamlined
> > as such. OTOH, what's the cost if separate files don't happen to follow
> > the same write point like this (i.e. as if cp were run one file at a
> > time)? Is there some indirect/internal to bcachefs benefit to this
> > behavior, or are we just trying to batch writes?
> 
> There's two main factors.
> 
> Firstly, we've only got so many write points, so we're more likely to end up
> using the same write points as another application, and interspersing our data
> with theirs.
> 
> The big factor is bucket based allocation and copygc: preserving locality is
> more important in bcachefs than in other filesystems, because we have to
> evacuate a bucket before it can be reused. Keeping related data grouped together
> in the same bucket means that it's much more likely to all die (be deleted or
> overwritten) at the same time, and copygc won't have to do any work.
> 

OK, thanks. This is context I didn't have and gives me a bit more to dig
into.

> (Side note: bucket based allocation and copygc is what will make really good
> zoned device supoprt possible, which has been part of the design since forever
> and something I want to see finished - I've just been focusing on more core
> things).
> 
> > What about scenarios where the writing task is less relevant? For
> > example, perhaps something like a file server ingesting data for
> > multiple sets of large files via a thread pool or some such.. I'm not
> > sure I see as much value of a worker task doling out per-task hinted
> > allocations across a set of the N active files it might be processing..
> 
> There's been talk of plumbing a write point API up to userspace before - SSDs
> could make good use of it if they had it.
> 
> Or, perhaps we could detect when a task is writing to multiple files at the same
> time (not one after the other, as cp would) and switch to writepoint-per-file
> then.
> 

That's more along the lines of the direction I was expecting this to go
(if anywhere)...

> > The reason it stood out to me is that I'm not aware of fs' going through
> > major pains to preserve per-task locality across inodes in this sort of
> > way (XFS, for example, may deliberately spread files over the disk if
> > they happen to span directories, perhaps for different reasons). OTOH,
> > we do put effort into reducing per-file fragmentation in scenarios like
> > concurrent sustained buffered writes or random write/alloc workloads to
> > sparse files, and then usually hear from users one way or another when
> > those optimizations aren't effective enough. ;)
> 
> Maybe you could tell me more? I don't know nearly as much as I'd like to about
> allocator strategies in XFS; I've never gone as deep here as I'd like beacuse,
> quite honestly, people haven't complained :)
> 

Yeah.. I suspect typical buffered write/writeback behavior turns this
more into a potential refinement than a bug or anything.

A couple of the more common optimizations XFS uses are speculative
preallocation and extent size hints. The former is designed to mitigate
fragmentation, particularly in the case of concurrent sustained writes.
Basically it will selectively increase the size of appending delalloc
reservations beyond current eof in anticipation of further writes. In
the meantime, writeback will attempt to allocate maximal sized physical
extents for contiguous delalloc ranges. Finally, the allocation itself
will start off with a simple hint based on the physical location of the
inode. This helps ensure extents are eventually maximally sized whenever
sufficient contiguous free extents are available and similarly ensures
as related inodes are removed, contiguous extents are freed together.
Excess/unused prealloc blocks are eventually reclaimed in the background
or as needed.

Extent size hints are more for random write/allocation scenarios and
must be set by the user. For example, consider a sparse vdisk image
seeing random small writes all over the place. If we allocate single
blocks at a time, fragmentation and the extent count can eventually
explode out of control. An extent size hint of 1MB or so ensures every
new allocation is sized/aligned as such and so helps mitigate that sort
of problem as more of the file is allocated.

Of course XFS is fundamentally different in that it's not a COW fs, so
might have different concerns. It supports reflinks, but that's a
relatively recent feature compared to the allocation heuristics and not
something they were designed around or significantly updated for (since
COW is not default behavior, although I believe an always_cow mode does
exist).

> (And the company funding me cares very much about streaming bandwidth on
> rotating disk, it's something we've looked at. The bigger issue last time we
> looked was that readdir order in bcachefs doesn't match create order or sort
> order, because directories are implemented as simple open addressing hash
> tables - meaning reading back files in a directory incurred unnecessary seeks on
> file boundaries. Another todo list item :)
> 
> > Eh, I'd have to go back at play around with it to provide specifics.. I
> > suspect I was doing something like multiple, concurrent, appending
> > writes to a single file from different tasks, also concurrent with a set
> > of sync_file_range() workers forcing frequent writeback activity, and
> > then seeing a notably large number of extents for the relative size of
> > the file. That turned around completely with the tweak in this patch.
> > 
> > This test wasn't intended to reflect any sort of sane workload. Rather,
> > to just try and confirm assumptions made from reading the code. I've not
> > actually tested anything like the more practical example mentioned
> > above, and I wouldn't expect it to result in anything nearly as
> > pathological as this test.
> 
> *nod*
> 
> > Did you have specific improvements or target use cases in mind in this
> > area..?
> 
> I do think we could probably be doing something more than using the pid for the
> writepoint, I've just been waiting until we see specific workloads where the
> current behaviour falls over or have a specific complaint before designing
> something new.
> 

Ok. Based on the above, it kind of sounds like a worse case scenario
might be something like N files allocated by the same task in such a way
that each bucket ends up split between the N files, and then some number
of files end up removed. Rinse and repeat that sort of thing across new
sets of files and then presumably we'd have increasing amount of free
space in partially used buckets that cannot be allocated..?

Is copygc responsible for cleaning things up in such a case in order to
create more usable free space (hence the excessive copygc comment
below)?

> Target use cases - I want bcachefs to be able to handle any mainstream
> application XFS can (as well as brtfs/ext4/zfs) - i.e. be a truly general
> purpose filesystem. Anything needed to make that happen is on the table, long
> term - including potentially switching out the entire bucket-based allocator for
> something else, as an option - the codebase is flexible enough that we could do
> that.
> 

Good to know.

> I doubt it'll come to that though, especially now that we've got nocow mode - my
> expectation is that we'll be able to get it to do what we need to with perhaps
> minor tweaks to writepoint usage. Always contingent upon more data, of course.
> 
> More immediately, I'm trying to set things up so that we've got good visibility
> into what's going on can see what needs to be tweaked and tuned later.
> 
> One big thing we'll want to watch for is excessive copygc - that'll be an
> indication our data layout could've been better. We've got backpointers from
> buckets back to extents and inodes back to dirents, which means we could pretty
> easily have the copygc tracepoints start telling us the specific filenames it
> was moving data for.
> 

Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks
for all of the feedback and context..

Brian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-16  7:04       ` Kent Overstreet
@ 2022-12-19 15:42         ` Brian Foster
  2022-12-20  1:56           ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-19 15:42 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Dec 16, 2022 at 02:04:45AM -0500, Kent Overstreet wrote:
> On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> > I do think we could probably be doing something more than using the pid for the
> > writepoint, I've just been waiting until we see specific workloads where the
> > current behaviour falls over or have a specific complaint before designing
> > something new.
> 
> Random late night thoughts:
> 
> Say we introduce a new object, 'writepoint_handle' or somesuch.
> 
> Allocate them when opening a file for write, close them when close the file.
> 
> Then we'd be explicitly picking which writepoint to use when allocating the
> writepoint_handle; it would be easy to add logic for "if there's a writepoint
> which was last used by this process and doesn't currently have any handles
> pointing to it, use that".
> 

Ok, but if we alloc the handle at open (or first write or whatever),
we'd still need to potentially keep it around after ->release() (i.e.
userspace close()) while the mapping is dirty and thus still needs to be
written back, right? 

If so, perhaps this would need some additional state to track an
"active" writepoint, explicitly defined as a "writepoint with currently
open files" as opposed to simply a handle pointer? IOW, if the task is
no longer writing to the previous file, it's probably Ok to reuse that
writepoint even though the handle might still have a reference..?

But generally I think I get the idea: preserve the current ability for a
single sequential writer to use the same writepoint across N files, but
fall back to a separate writepoint where we otherwise detect multi-file
activity. I think that makes sense, though I'd probably have to think a
bit more about an explicit open() -> close() handle lifecycle and
whether that's robust enough for fileserver like use cases. I.e., I'd be
a little concerned about whether that workload might make inter-spersed
sub-file writes look a bit too much like the single user open -> write
-> close -> repeat use case..

> One of the things that needs to be considered is - what do we do when there's
> more writepoint_handles than writepoints?
> 

Does bcachefs have to deal with something like that today? For example
if there is some max number of writepoints, what happens if there might
be some greater number of tasks doing allocations at the same time?

Brian

> bcache has some logic for this by tracking when a writepoint was used, and if we
> don't find a writepoint that matches up with the IO being issued - pick the
> oldest one off an LRU queue. Was dropped in bcachefs because the straight hash
> table seemed to work just as well and was faster - or maybe I'm thinking of the
> sequential bypass data structure?
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-19 15:27       ` Brian Foster
@ 2022-12-20  1:02         ` Kent Overstreet
  2022-12-22 14:03           ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-20  1:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> A couple of the more common optimizations XFS uses are speculative
> preallocation and extent size hints. The former is designed to mitigate
> fragmentation, particularly in the case of concurrent sustained writes.
> Basically it will selectively increase the size of appending delalloc
> reservations beyond current eof in anticipation of further writes. In
> the meantime, writeback will attempt to allocate maximal sized physical
> extents for contiguous delalloc ranges. Finally, the allocation itself
> will start off with a simple hint based on the physical location of the
> inode. This helps ensure extents are eventually maximally sized whenever
> sufficient contiguous free extents are available and similarly ensures
> as related inodes are removed, contiguous extents are freed together.
> Excess/unused prealloc blocks are eventually reclaimed in the background
> or as needed.
> 
> Extent size hints are more for random write/allocation scenarios and
> must be set by the user. For example, consider a sparse vdisk image
> seeing random small writes all over the place. If we allocate single
> blocks at a time, fragmentation and the extent count can eventually
> explode out of control. An extent size hint of 1MB or so ensures every
> new allocation is sized/aligned as such and so helps mitigate that sort
> of problem as more of the file is allocated.
> 
> Of course XFS is fundamentally different in that it's not a COW fs, so
> might have different concerns. It supports reflinks, but that's a
> relatively recent feature compared to the allocation heuristics and not
> something they were designed around or significantly updated for (since
> COW is not default behavior, although I believe an always_cow mode does
> exist).

*nod* Yeah, I've been wondering how much this stuff makes sense in the context
of a COW filesystem.

But we do have nocow mode, complete with unwritten extents. If we need to go the
delalloc route, I think the existing allocator design should be able to support
that (we can pin space purely as an in memory operation, but we do have a fixed
number of those so we have to be careful about introducing deadlocks).

It sounds like the optimizations XFS is doing are trying to ensure that writes
remain contiguous on disk even when buffered writeback isn't batching them up as
much as we'd like? Is that something we still feel is important?
Pagecache/system memory size keeps going up but seek times do not (and go down
in the case of flash); it's not clear to me that this is still important today.

> Ok. Based on the above, it kind of sounds like a worse case scenario
> might be something like N files allocated by the same task in such a way
> that each bucket ends up split between the N files, and then some number
> of files end up removed. Rinse and repeat that sort of thing across new
> sets of files and then presumably we'd have increasing amount of free
> space in partially used buckets that cannot be allocated..?

Yep, that would do it.

> 
> Is copygc responsible for cleaning things up in such a case in order to
> create more usable free space (hence the excessive copygc comment
> below)?

Correct. Copygc finds buckets that are mostly but not completely empty and
evacuates them - writes the data in them to new buckets.

Copygc doesn't do any file-level defragmentation, but now that we have
backpointers it could.

> Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks
> for all of the feedback and context..

Feel free to hit me up on IRC as you're looking at code. I'm also currently
working on the copygc code - we have a persistent fragmentation index about to
land, which will be a drastic improvement to copygc scalability. Not relevant to
what you're looking at, but the code is at least fresh in my mind :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-19 15:42         ` Brian Foster
@ 2022-12-20  1:56           ` Kent Overstreet
  2022-12-28 22:24             ` Eric Wheeler
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-20  1:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Dec 19, 2022 at 10:42:09AM -0500, Brian Foster wrote:
> On Fri, Dec 16, 2022 at 02:04:45AM -0500, Kent Overstreet wrote:
> > On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> > > I do think we could probably be doing something more than using the pid for the
> > > writepoint, I've just been waiting until we see specific workloads where the
> > > current behaviour falls over or have a specific complaint before designing
> > > something new.
> > 
> > Random late night thoughts:
> > 
> > Say we introduce a new object, 'writepoint_handle' or somesuch.
> > 
> > Allocate them when opening a file for write, close them when close the file.
> > 
> > Then we'd be explicitly picking which writepoint to use when allocating the
> > writepoint_handle; it would be easy to add logic for "if there's a writepoint
> > which was last used by this process and doesn't currently have any handles
> > pointing to it, use that".
> > 
> 
> Ok, but if we alloc the handle at open (or first write or whatever),
> we'd still need to potentially keep it around after ->release() (i.e.
> userspace close()) while the mapping is dirty and thus still needs to be
> written back, right? 

Yeah, very much just spitballing here, it's not a fully formed thought :)

> If so, perhaps this would need some additional state to track an
> "active" writepoint, explicitly defined as a "writepoint with currently
> open files" as opposed to simply a handle pointer? IOW, if the task is
> no longer writing to the previous file, it's probably Ok to reuse that
> writepoint even though the handle might still have a reference..?

Yeah, I'm envisioning explicitly allocating writepoints instead of just using a
hash function to pick them, and then when we allocate a writepoint we'd try to
pick one that isn't currently being used.

> But generally I think I get the idea: preserve the current ability for a
> single sequential writer to use the same writepoint across N files, but
> fall back to a separate writepoint where we otherwise detect multi-file
> activity. I think that makes sense, though I'd probably have to think a
> bit more about an explicit open() -> close() handle lifecycle and
> whether that's robust enough for fileserver like use cases. I.e., I'd be
> a little concerned about whether that workload might make inter-spersed
> sub-file writes look a bit too much like the single user open -> write
> -> close -> repeat use case..

If an application closes a file, perhaps that should be an explicit hint that we
don't expect more writes to it...?

> > One of the things that needs to be considered is - what do we do when there's
> > more writepoint_handles than writepoints?
> > 
> 
> Does bcachefs have to deal with something like that today? For example
> if there is some max number of writepoints, what happens if there might
> be some greater number of tasks doing allocations at the same time?

There's a hardcoded number of writepoints (32, IIRC?) - if we have more tasks
doing writes than writepoints then writes are going to end up interspresed.

We have to be careful about increasing this or making it dynamic, since it's
inherently pinning other resources - although when the filesystem is very nearly
full there's code to steal from other writepoints, so we're not deadlocking
because all our free space is stranded on other writepoints.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-20  1:02         ` Kent Overstreet
@ 2022-12-22 14:03           ` Brian Foster
  2022-12-23  4:36             ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-22 14:03 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote:
> On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> > A couple of the more common optimizations XFS uses are speculative
> > preallocation and extent size hints. The former is designed to mitigate
> > fragmentation, particularly in the case of concurrent sustained writes.
> > Basically it will selectively increase the size of appending delalloc
> > reservations beyond current eof in anticipation of further writes. In
> > the meantime, writeback will attempt to allocate maximal sized physical
> > extents for contiguous delalloc ranges. Finally, the allocation itself
> > will start off with a simple hint based on the physical location of the
> > inode. This helps ensure extents are eventually maximally sized whenever
> > sufficient contiguous free extents are available and similarly ensures
> > as related inodes are removed, contiguous extents are freed together.
> > Excess/unused prealloc blocks are eventually reclaimed in the background
> > or as needed.
> > 
> > Extent size hints are more for random write/allocation scenarios and
> > must be set by the user. For example, consider a sparse vdisk image
> > seeing random small writes all over the place. If we allocate single
> > blocks at a time, fragmentation and the extent count can eventually
> > explode out of control. An extent size hint of 1MB or so ensures every
> > new allocation is sized/aligned as such and so helps mitigate that sort
> > of problem as more of the file is allocated.
> > 
> > Of course XFS is fundamentally different in that it's not a COW fs, so
> > might have different concerns. It supports reflinks, but that's a
> > relatively recent feature compared to the allocation heuristics and not
> > something they were designed around or significantly updated for (since
> > COW is not default behavior, although I believe an always_cow mode does
> > exist).
> 
> *nod* Yeah, I've been wondering how much this stuff makes sense in the context
> of a COW filesystem.
> 
> But we do have nocow mode, complete with unwritten extents. If we need to go the
> delalloc route, I think the existing allocator design should be able to support
> that (we can pin space purely as an in memory operation, but we do have a fixed
> number of those so we have to be careful about introducing deadlocks).
> 
> It sounds like the optimizations XFS is doing are trying to ensure that writes
> remain contiguous on disk even when buffered writeback isn't batching them up as
> much as we'd like? Is that something we still feel is important?
> Pagecache/system memory size keeps going up but seek times do not (and go down
> in the case of flash); it's not clear to me that this is still important today.
> 

That's a good question and I don't really know the answer. I suspect
there is more to it than the fundamental principles of hardware and
related improvements. In practice these sorts of things still improve fs
efficiency, performance, scalability, aging (perhaps under less than
ideal hardware/workload/resource conditions), etc. Given the relative
low cost (in terms of complexity) of the implementation, they certainly
aren't things I see going away from fs' like XFS any time soon. The
underlying concepts may just not be as generically relevant (i.e. useful
across different fs implementations) as perhaps they might have been in
the past.

When you think about it, it is kind of amusing to see things like the fs
attempt to create as large/contiguous mappings as possible, only for
writeback to subsequently have to explicitly break them up into smaller
I/O requests because otherwise the massive amount of in-core metadata
status updates that result (i.e. clearing per-page writeback state)
leads to excessive completion latency. ;) OTOH, if that eventually leads
to more use of things like large folios, then perhaps that's an overall
win.

Anyways, I just bring these things up here for reference and discussion
purposes..

> > Ok. Based on the above, it kind of sounds like a worse case scenario
> > might be something like N files allocated by the same task in such a way
> > that each bucket ends up split between the N files, and then some number
> > of files end up removed. Rinse and repeat that sort of thing across new
> > sets of files and then presumably we'd have increasing amount of free
> > space in partially used buckets that cannot be allocated..?
> 
> Yep, that would do it.
> 

Ok.

> > 
> > Is copygc responsible for cleaning things up in such a case in order to
> > create more usable free space (hence the excessive copygc comment
> > below)?
> 
> Correct. Copygc finds buckets that are mostly but not completely empty and
> evacuates them - writes the data in them to new buckets.
> 
> Copygc doesn't do any file-level defragmentation, but now that we have
> backpointers it could.
> 

Cool.

> > Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks
> > for all of the feedback and context..
> 
> Feel free to hit me up on IRC as you're looking at code. I'm also currently
> working on the copygc code - we have a persistent fragmentation index about to
> land, which will be a drastic improvement to copygc scalability. Not relevant to
> what you're looking at, but the code is at least fresh in my mind :)
> 

Thanks. Appreciate the feedback here and in the other subthread. I'm
currently mostly trying to grok core concepts and map them to areas of
code and such, but will undoubtedly have more questions once I get more
into details..

Brian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-22 14:03           ` Brian Foster
@ 2022-12-23  4:36             ` Kent Overstreet
  2022-12-23 11:49               ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-23  4:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Dec 22, 2022 at 09:03:22AM -0500, Brian Foster wrote:
> On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote:
> > On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> > > A couple of the more common optimizations XFS uses are speculative
> > > preallocation and extent size hints. The former is designed to mitigate
> > > fragmentation, particularly in the case of concurrent sustained writes.
> > > Basically it will selectively increase the size of appending delalloc
> > > reservations beyond current eof in anticipation of further writes. In
> > > the meantime, writeback will attempt to allocate maximal sized physical
> > > extents for contiguous delalloc ranges. Finally, the allocation itself
> > > will start off with a simple hint based on the physical location of the
> > > inode. This helps ensure extents are eventually maximally sized whenever
> > > sufficient contiguous free extents are available and similarly ensures
> > > as related inodes are removed, contiguous extents are freed together.
> > > Excess/unused prealloc blocks are eventually reclaimed in the background
> > > or as needed.
> > > 
> > > Extent size hints are more for random write/allocation scenarios and
> > > must be set by the user. For example, consider a sparse vdisk image
> > > seeing random small writes all over the place. If we allocate single
> > > blocks at a time, fragmentation and the extent count can eventually
> > > explode out of control. An extent size hint of 1MB or so ensures every
> > > new allocation is sized/aligned as such and so helps mitigate that sort
> > > of problem as more of the file is allocated.
> > > 
> > > Of course XFS is fundamentally different in that it's not a COW fs, so
> > > might have different concerns. It supports reflinks, but that's a
> > > relatively recent feature compared to the allocation heuristics and not
> > > something they were designed around or significantly updated for (since
> > > COW is not default behavior, although I believe an always_cow mode does
> > > exist).
> > 
> > *nod* Yeah, I've been wondering how much this stuff makes sense in the context
> > of a COW filesystem.
> > 
> > But we do have nocow mode, complete with unwritten extents. If we need to go the
> > delalloc route, I think the existing allocator design should be able to support
> > that (we can pin space purely as an in memory operation, but we do have a fixed
> > number of those so we have to be careful about introducing deadlocks).
> > 
> > It sounds like the optimizations XFS is doing are trying to ensure that writes
> > remain contiguous on disk even when buffered writeback isn't batching them up as
> > much as we'd like? Is that something we still feel is important?
> > Pagecache/system memory size keeps going up but seek times do not (and go down
> > in the case of flash); it's not clear to me that this is still important today.
> > 
> 
> That's a good question and I don't really know the answer. I suspect
> there is more to it than the fundamental principles of hardware and
> related improvements. In practice these sorts of things still improve fs
> efficiency, performance, scalability, aging (perhaps under less than
> ideal hardware/workload/resource conditions), etc. Given the relative
> low cost (in terms of complexity) of the implementation, they certainly
> aren't things I see going away from fs' like XFS any time soon. The
> underlying concepts may just not be as generically relevant (i.e. useful
> across different fs implementations) as perhaps they might have been in
> the past.

Certainly no pressing reason to drop that code from XFS, but bcachefs is a clean
slate (and COW introduces different challenges) so I have to think about things
differently.

> 
> When you think about it, it is kind of amusing to see things like the fs
> attempt to create as large/contiguous mappings as possible, only for
> writeback to subsequently have to explicitly break them up into smaller
> I/O requests because otherwise the massive amount of in-core metadata
> status updates that result (i.e. clearing per-page writeback state)
> leads to excessive completion latency. ;) OTOH, if that eventually leads
> to more use of things like large folios, then perhaps that's an overall
> win.

Hmm? That sounds like an odd way to describe things.

Folios are certainly long overdue and are going to be a big help, but more in
the buffered IO paths than writeback I expect. Writeback can and does aggregate
adjacent pages into the same IO; in bcachefs this is right now limited to 2 MB
in practice because we represent the IO from the very start as a bio, but large
folios + multipage bvecs should finally get us past that limit.

IOW, once I do the large folio conversion writeback ought to be generating
bucket sized extents - when data checksums are off, as they'll otherwise limit
extent size (and that restriction will be lifted once we get extents with block
granular/variable granularity checksums).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-23  4:36             ` Kent Overstreet
@ 2022-12-23 11:49               ` Brian Foster
  2022-12-23 18:02                 ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2022-12-23 11:49 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Thu, Dec 22, 2022 at 11:36:18PM -0500, Kent Overstreet wrote:
> On Thu, Dec 22, 2022 at 09:03:22AM -0500, Brian Foster wrote:
> > On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote:
> > > On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote:
> > > > A couple of the more common optimizations XFS uses are speculative
> > > > preallocation and extent size hints. The former is designed to mitigate
> > > > fragmentation, particularly in the case of concurrent sustained writes.
> > > > Basically it will selectively increase the size of appending delalloc
> > > > reservations beyond current eof in anticipation of further writes. In
> > > > the meantime, writeback will attempt to allocate maximal sized physical
> > > > extents for contiguous delalloc ranges. Finally, the allocation itself
> > > > will start off with a simple hint based on the physical location of the
> > > > inode. This helps ensure extents are eventually maximally sized whenever
> > > > sufficient contiguous free extents are available and similarly ensures
> > > > as related inodes are removed, contiguous extents are freed together.
> > > > Excess/unused prealloc blocks are eventually reclaimed in the background
> > > > or as needed.
> > > > 
> > > > Extent size hints are more for random write/allocation scenarios and
> > > > must be set by the user. For example, consider a sparse vdisk image
> > > > seeing random small writes all over the place. If we allocate single
> > > > blocks at a time, fragmentation and the extent count can eventually
> > > > explode out of control. An extent size hint of 1MB or so ensures every
> > > > new allocation is sized/aligned as such and so helps mitigate that sort
> > > > of problem as more of the file is allocated.
> > > > 
> > > > Of course XFS is fundamentally different in that it's not a COW fs, so
> > > > might have different concerns. It supports reflinks, but that's a
> > > > relatively recent feature compared to the allocation heuristics and not
> > > > something they were designed around or significantly updated for (since
> > > > COW is not default behavior, although I believe an always_cow mode does
> > > > exist).
> > > 
> > > *nod* Yeah, I've been wondering how much this stuff makes sense in the context
> > > of a COW filesystem.
> > > 
> > > But we do have nocow mode, complete with unwritten extents. If we need to go the
> > > delalloc route, I think the existing allocator design should be able to support
> > > that (we can pin space purely as an in memory operation, but we do have a fixed
> > > number of those so we have to be careful about introducing deadlocks).
> > > 
> > > It sounds like the optimizations XFS is doing are trying to ensure that writes
> > > remain contiguous on disk even when buffered writeback isn't batching them up as
> > > much as we'd like? Is that something we still feel is important?
> > > Pagecache/system memory size keeps going up but seek times do not (and go down
> > > in the case of flash); it's not clear to me that this is still important today.
> > > 
> > 
> > That's a good question and I don't really know the answer. I suspect
> > there is more to it than the fundamental principles of hardware and
> > related improvements. In practice these sorts of things still improve fs
> > efficiency, performance, scalability, aging (perhaps under less than
> > ideal hardware/workload/resource conditions), etc. Given the relative
> > low cost (in terms of complexity) of the implementation, they certainly
> > aren't things I see going away from fs' like XFS any time soon. The
> > underlying concepts may just not be as generically relevant (i.e. useful
> > across different fs implementations) as perhaps they might have been in
> > the past.
> 
> Certainly no pressing reason to drop that code from XFS, but bcachefs is a clean
> slate (and COW introduces different challenges) so I have to think about things
> differently.
> 
> > 
> > When you think about it, it is kind of amusing to see things like the fs
> > attempt to create as large/contiguous mappings as possible, only for
> > writeback to subsequently have to explicitly break them up into smaller
> > I/O requests because otherwise the massive amount of in-core metadata
> > status updates that result (i.e. clearing per-page writeback state)
> > leads to excessive completion latency. ;) OTOH, if that eventually leads
> > to more use of things like large folios, then perhaps that's an overall
> > win.
> 
> Hmm? That sounds like an odd way to describe things.
> 

IIRC that's pretty much the current behavior with XFS and iomap. XFS
aggressively allocates large contiguous extents, writeback thus
constructs large enough bio chains in the iomap ioend that completion
processing can produce soft lockups just dealing with pages involved
with the bio chain. Therefore the ioend became capped to a max number of
chained bios. Of course writeback still carries on submitting the same
amount of overall I/O either way, just made up of more ioends with
smaller bio chains. But I suspect as folio size is able to increase,
we'll be back to constructing larger I/Os based on fewer folios and thus
with less processing overhead per submission (in scenarios where
multipage bvecs doesn't already do so, at least).

Brian

> Folios are certainly long overdue and are going to be a big help, but more in
> the buffered IO paths than writeback I expect. Writeback can and does aggregate
> adjacent pages into the same IO; in bcachefs this is right now limited to 2 MB
> in practice because we represent the IO from the very start as a bio, but large
> folios + multipage bvecs should finally get us past that limit.
> 
> IOW, once I do the large folio conversion writeback ought to be generating
> bucket sized extents - when data checksums are off, as they'll otherwise limit
> extent size (and that restriction will be lifted once we get extents with block
> granular/variable granularity checksums).
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-23 11:49               ` Brian Foster
@ 2022-12-23 18:02                 ` Kent Overstreet
  0 siblings, 0 replies; 17+ messages in thread
From: Kent Overstreet @ 2022-12-23 18:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Dec 23, 2022 at 06:49:05AM -0500, Brian Foster wrote:
> IIRC that's pretty much the current behavior with XFS and iomap. XFS
> aggressively allocates large contiguous extents, writeback thus
> constructs large enough bio chains in the iomap ioend that completion
> processing can produce soft lockups just dealing with pages involved
> with the bio chain. Therefore the ioend became capped to a max number of
> chained bios. Of course writeback still carries on submitting the same
> amount of overall I/O either way, just made up of more ioends with
> smaller bio chains. But I suspect as folio size is able to increase,
> we'll be back to constructing larger I/Os based on fewer folios and thus
> with less processing overhead per submission (in scenarios where
> multipage bvecs doesn't already do so, at least).

Multipage bvecs just move segment merging up to bio_add_page() and allow more
pages to be stuffed into a bio when they're contiguous in memory - they don't
help with completion processing. Large folios will.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-20  1:56           ` Kent Overstreet
@ 2022-12-28 22:24             ` Eric Wheeler
  2022-12-29 20:59               ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wheeler @ 2022-12-28 22:24 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs

On Mon, 19 Dec 2022, Kent Overstreet wrote:
> On Mon, Dec 19, 2022 at 10:42:09AM -0500, Brian Foster wrote:
> > On Fri, Dec 16, 2022 at 02:04:45AM -0500, Kent Overstreet wrote:
> > > On Thu, Dec 15, 2022 at 07:09:58PM -0500, Kent Overstreet wrote:
> > > > I do think we could probably be doing something more than using the pid for the
> > > > writepoint, I've just been waiting until we see specific workloads where the
> > > > current behaviour falls over or have a specific complaint before designing
> > > > something new.
> > > 
> > > Random late night thoughts:
> > > 
> > > Say we introduce a new object, 'writepoint_handle' or somesuch.
> > > 
> > > Allocate them when opening a file for write, close them when close the file.
> > > 
> > > Then we'd be explicitly picking which writepoint to use when allocating the
> > > writepoint_handle; it would be easy to add logic for "if there's a writepoint
> > > which was last used by this process and doesn't currently have any handles
> > > pointing to it, use that".
> > > 
> > 
> > Ok, but if we alloc the handle at open (or first write or whatever),
> > we'd still need to potentially keep it around after ->release() (i.e.
> > userspace close()) while the mapping is dirty and thus still needs to be
> > written back, right? 
> 
> Yeah, very much just spitballing here, it's not a fully formed thought :)
> 
> > If so, perhaps this would need some additional state to track an
> > "active" writepoint, explicitly defined as a "writepoint with currently
> > open files" as opposed to simply a handle pointer? IOW, if the task is
> > no longer writing to the previous file, it's probably Ok to reuse that
> > writepoint even though the handle might still have a reference..?
> 
> Yeah, I'm envisioning explicitly allocating writepoints instead of just using a
> hash function to pick them, and then when we allocate a writepoint we'd try to
> pick one that isn't currently being used.
> 
> > But generally I think I get the idea: preserve the current ability for a
> > single sequential writer to use the same writepoint across N files, but
> > fall back to a separate writepoint where we otherwise detect multi-file
> > activity. I think that makes sense, though I'd probably have to think a
> > bit more about an explicit open() -> close() handle lifecycle and
> > whether that's robust enough for fileserver like use cases. I.e., I'd be
> > a little concerned about whether that workload might make inter-spersed
> > sub-file writes look a bit too much like the single user open -> write
> > -> close -> repeat use case..
> 
> If an application closes a file, perhaps that should be an explicit hint that we
> don't expect more writes to it...?
> 
> > > One of the things that needs to be considered is - what do we do when there's
> > > more writepoint_handles than writepoints?
> > > 
> > 
> > Does bcachefs have to deal with something like that today? For example
> > if there is some max number of writepoints, what happens if there might
> > be some greater number of tasks doing allocations at the same time?
> 
> There's a hardcoded number of writepoints (32, IIRC?) - if we have more tasks
> doing writes than writepoints then writes are going to end up interspresed.
> 
> We have to be careful about increasing this or making it dynamic, since it's
> inherently pinning other resources - although when the filesystem is very nearly
> full there's code to steal from other writepoints, so we're not deadlocking
> because all our free space is stranded on other writepoints.

This reminds me of an LWN article I read many years ago about how SD cards 
handle "open segments" in their FTL:
	https://lwn.net/Articles/428584/

	"Restrictions on open segments: One major difference between
	the various [sdcard] manufacturers is how many segments they
	can write to at any given time.  Starting to write a segment
	requires another physical segment, or two in case of a data
	logging algorithm, to be reserved, and requires some RAM on the
	embedded microcontroller to maintain the segment. Writing to a
	new segment will cause garbage collection on a previously open
	segment. That can lead to thrashing as the drive must repeatedly
	switch open segments"

SD cards are often used on today's low-end embedded ARM hardware for 
during prototyping (such as Raspberry Pi and others), so this may still be 
relevant today.  If bcachefs is capable of operating with very few 
writepoints (ie, if max writepoints were configurable), then it may 
perform much better on low-end SD cards.

Making the bcachefs bucket size equal to the segment size would help a 
lot, I'm sure, but this data is rarely published by manufacturers.  
However, it can be probed:
	https://git.linaro.org/people/arnd/flashbench.git/about/

It might be useful to make the writepoints limit configurable at mkfs 
time.  mkfs could even probe the storage volume to find out when 
increasing the number of possible writepoints causes write latency to 
spike.  The flash "segment size" could be probed similarly and 
auto-configure some of the bcachefs tunables.

Anyway, food for thought.

--
Eric Wheeler


 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-28 22:24             ` Eric Wheeler
@ 2022-12-29 20:59               ` Kent Overstreet
  2022-12-29 22:26                 ` Eric Wheeler
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2022-12-29 20:59 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Brian Foster, linux-bcachefs

On Wed, Dec 28, 2022 at 02:24:45PM -0800, Eric Wheeler wrote:
 
> This reminds me of an LWN article I read many years ago about how SD cards 
> handle "open segments" in their FTL:
> 	https://lwn.net/Articles/428584/
> 
> 	"Restrictions on open segments: One major difference between
> 	the various [sdcard] manufacturers is how many segments they
> 	can write to at any given time.  Starting to write a segment
> 	requires another physical segment, or two in case of a data
> 	logging algorithm, to be reserved, and requires some RAM on the
> 	embedded microcontroller to maintain the segment. Writing to a
> 	new segment will cause garbage collection on a previously open
> 	segment. That can lead to thrashing as the drive must repeatedly
> 	switch open segments"
> 
> SD cards are often used on today's low-end embedded ARM hardware for 
> during prototyping (such as Raspberry Pi and others), so this may still be 
> relevant today.  If bcachefs is capable of operating with very few 
> writepoints (ie, if max writepoints were configurable), then it may 
> perform much better on low-end SD cards.

It is. When the filesystem is close to full we effectively operate with fewer
writepoints to avoid stranded free space, adding a config option would be pretty
trivial.

> Making the bcachefs bucket size equal to the segment size would help a 
> lot, I'm sure, but this data is rarely published by manufacturers.  
> However, it can be probed:
> 	https://git.linaro.org/people/arnd/flashbench.git/about/
> 
> It might be useful to make the writepoints limit configurable at mkfs 
> time.  mkfs could even probe the storage volume to find out when 
> increasing the number of possible writepoints causes write latency to 
> spike.  The flash "segment size" could be probed similarly and 
> auto-configure some of the bcachefs tunables.

More interesting to me would be finishing ZNS support - that's going to be
considerably more robust than any probing, any time there's an FTL underneath we
can't know that data is being laid out the way we want it to. Perhaps SD card
FTLs are simple enough that this could be made to work, but we'd be at the mercy
of opaque implementation details.

But if we get ZNS support finished, perhaps we could get SD card manufacturers
to start exposing that interface or something equivalent, and that should get us
not only better performance but potentially better reliability too - we'd be
able to implement proper wear leveling, something SD cards at least historically
haven't generally done.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-29 20:59               ` Kent Overstreet
@ 2022-12-29 22:26                 ` Eric Wheeler
  2022-12-30  3:18                   ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wheeler @ 2022-12-29 22:26 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs

On Thu, 29 Dec 2022, Kent Overstreet wrote:
> On Wed, Dec 28, 2022 at 02:24:45PM -0800, Eric Wheeler wrote:
>  
> > This reminds me of an LWN article I read many years ago about how SD cards 
> > handle "open segments" in their FTL:
> > 	https://lwn.net/Articles/428584/
> > 
> > 	"Restrictions on open segments: One major difference between
> > 	the various [sdcard] manufacturers is how many segments they
> > 	can write to at any given time.  Starting to write a segment
> > 	requires another physical segment, or two in case of a data
> > 	logging algorithm, to be reserved, and requires some RAM on the
> > 	embedded microcontroller to maintain the segment. Writing to a
> > 	new segment will cause garbage collection on a previously open
> > 	segment. That can lead to thrashing as the drive must repeatedly
> > 	switch open segments"
> > 
> > SD cards are often used on today's low-end embedded ARM hardware for 
> > during prototyping (such as Raspberry Pi and others), so this may still be 
> > relevant today.  If bcachefs is capable of operating with very few 
> > writepoints (ie, if max writepoints were configurable), then it may 
> > perform much better on low-end SD cards.
> 
> It is. When the filesystem is close to full we effectively operate with fewer
> writepoints to avoid stranded free space, adding a config option would be pretty
> trivial.
> 
> > Making the bcachefs bucket size equal to the segment size would help a 
> > lot, I'm sure, but this data is rarely published by manufacturers.  
> > However, it can be probed:
> > 	https://git.linaro.org/people/arnd/flashbench.git/about/
> > 
> > It might be useful to make the writepoints limit configurable at mkfs 
> > time.  mkfs could even probe the storage volume to find out when 
> > increasing the number of possible writepoints causes write latency to 
> > spike.  The flash "segment size" could be probed similarly and 
> > auto-configure some of the bcachefs tunables.
> 
> More interesting to me would be finishing ZNS support - that's going to be
> considerably more robust than any probing, any time there's an FTL underneath we
> can't know that data is being laid out the way we want it to. Perhaps SD card
> FTLs are simple enough that this could be made to work, but we'd be at the mercy
> of opaque implementation details.
> 
> But if we get ZNS support finished, perhaps we could get SD card manufacturers
> to start exposing that interface or something equivalent, and that should get us
> not only better performance but potentially better reliability too - we'd be
> able to implement proper wear leveling, something SD cards at least historically
> haven't generally done.

That would be great!

--
Eric Wheeler



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] bcachefs: use inode as write point index instead of task
  2022-12-29 22:26                 ` Eric Wheeler
@ 2022-12-30  3:18                   ` Kent Overstreet
  0 siblings, 0 replies; 17+ messages in thread
From: Kent Overstreet @ 2022-12-30  3:18 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Brian Foster, linux-bcachefs

On Thu, Dec 29, 2022 at 02:26:03PM -0800, Eric Wheeler wrote:
> That would be great!

Certainly would be. Going to be awhile with current resources, though.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-12-30  3:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-12 19:06 [PATCH RFC] bcachefs: use inode as write point index instead of task Brian Foster
2022-12-13 18:38 ` Kent Overstreet
2022-12-14 17:44   ` Brian Foster
2022-12-16  0:09     ` Kent Overstreet
2022-12-16  7:04       ` Kent Overstreet
2022-12-19 15:42         ` Brian Foster
2022-12-20  1:56           ` Kent Overstreet
2022-12-28 22:24             ` Eric Wheeler
2022-12-29 20:59               ` Kent Overstreet
2022-12-29 22:26                 ` Eric Wheeler
2022-12-30  3:18                   ` Kent Overstreet
2022-12-19 15:27       ` Brian Foster
2022-12-20  1:02         ` Kent Overstreet
2022-12-22 14:03           ` Brian Foster
2022-12-23  4:36             ` Kent Overstreet
2022-12-23 11:49               ` Brian Foster
2022-12-23 18:02                 ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.