linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] bcachefs
@ 2021-06-02 23:07 Kent Overstreet
  2021-06-30 18:18 ` Dan Robertson
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2021-06-02 23:07 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel, linux-bcachefs

bcachefs is coming along nicely :) I'd like to give a talk about where things
are at, and get comments and feedback on the roadmap.

The short of it is, things are stabilizing nicely and snapshots are right around
the corner - snapshots are largely complete and from initial
testing/benchmarking it's looking really nice, better than I anticipated, and
I'm hoping to (finally!) merge bcachefs upstream sometime after snapshots are
merged.

I've recently been working on the roadmap, it goes into some detail about work
that still needs to be done:

https://bcachefs.org/Roadmap/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2021-06-02 23:07 [LSF/MM/BPF TOPIC] bcachefs Kent Overstreet
@ 2021-06-30 18:18 ` Dan Robertson
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Robertson @ 2021-06-30 18:18 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-fsdevel, linux-bcachefs

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

On Wed, Jun 02, 2021 at 07:07:53PM -0400, Kent Overstreet wrote:
> bcachefs is coming along nicely :) I'd like to give a talk about where things
> are at, and get comments and feedback on the roadmap.
>
> The short of it is, things are stabilizing nicely and snapshots are right around
> the corner - snapshots are largely complete and from initial
> testing/benchmarking it's looking really nice, better than I anticipated, and
> I'm hoping to (finally!) merge bcachefs upstream sometime after snapshots are
> merged.
>
> I've recently been working on the roadmap, it goes into some detail about work
> that still needs to be done:
>
> https://bcachefs.org/Roadmap/

As someone still new to the bcachefs code, I'd be very interested in this topic.
I'd be really interested in the testing roadmap as well as the feature roadmap.

Cheers,

 - Dan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-03 17:52       ` Kent Overstreet
  2024-01-03 19:22         ` Carl E. Thompson
@ 2024-01-04  7:43         ` Viacheslav Dubeyko
  1 sibling, 0 replies; 15+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-04  7:43 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel



> On Jan 3, 2024, at 8:52 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> On Wed, Jan 03, 2024 at 10:39:50AM +0300, Viacheslav Dubeyko wrote:
>> 
>> 
>>> On Jan 2, 2024, at 7:05 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>> 
>>> On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
>>>> 
>>>> 
>>>>> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>>>> 
>>>>> LSF topic: bcachefs status & roadmap
>>>>> 
>>>> 
>>>> <skipped>
>>>> 
>>>>> 
>>>>> A delayed allocation for btree nodes mode is coming, which is the main
>>>>> piece needed for ZNS support
>>>>> 
>>>> 
>>>> I could miss some emails. But have you shared the vision of ZNS support
>>>> architecture for the case of bcachefs already? It will be interesting to hear
>>>> the high-level concept.
>>> 
>>> There's not a whole lot to it. bcache/bcachefs allocation is already
>>> bucket based, where the model is that we allocate a bucket, then write
>>> to it sequentially and never overwrite until the whole bucket is reused.
>>> 
>>> The main exception has been btree nodes, which are log structured and
>>> typically smaller than a bucket; that doesn't break the "no overwrites"
>>> property ZNS wants, but it does mean writes within a bucket aren't
>>> happening sequentially.
>>> 
>>> So I'm adding a mode where every time we do a btree node write we write
>>> out the whole node to a new location, instead of appending at an
>>> existing location. It won't be as efficient for random updates across a
>>> large working set, but in practice that doesn't happen too much; average
>>> btree write size has always been quite high on any filesystem I've
>>> looked at.
>>> 
>>> Aside from that, it's mostly just plumbing and integration; bcachefs on
>>> ZNS will work pretty much just the same as bcachefs on regular block devices.
>> 
>> I assume that you are aware about limited number of open/active zones
>> on ZNS device. It means that you can open for write operations
>> only N zones simultaneously (for example, 14 zones for the case of WDC
>> ZNS device). Can bcachefs survive with such limitation? Can you limit the number
>> of buckets for write operations?
> 
> Yes, open/active zones correspond to write points in the bcachefs
> allocator. The default number of write points is 32 for user writes plus
> a few for internal ones, but it's not a problem to run with fewer.
> 

Frankly speaking, the 14 open/active zones limitation is extreme case.
Samsung provides bigger number for available open/active zones in ZNS SSD.
Even WDC made some promise to increase this number. But what’s the minimal
possible number of write pointers that can give opportunity for bcachefs still work
and survive in the environment of limited number of open/active zones?

So, does this change from default 32 write pointers to smaller number require
modification of file system driver logic (or maybe even on-disk layout)?
Or this is configurable parameter of file system? Is it internal configuration parameter
or end-user can configure the number of write pointers?

I see from documentation that expected size of the bucket is 128KB - 8MB.
Will 256MB - 2GB bucket size being digested by bcachefs without any modifications?
Or it could require some modification of logic (or even on-disk layout)? It’s pretty
significant bucket size change for my taste.

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-03 19:22         ` Carl E. Thompson
@ 2024-01-03 22:26           ` Kent Overstreet
  0 siblings, 0 replies; 15+ messages in thread
From: Kent Overstreet @ 2024-01-03 22:26 UTC (permalink / raw)
  To: Carl E. Thompson
  Cc: Viacheslav Dubeyko, lsf-pc, linux-bcachefs, linux-fsdevel

On Wed, Jan 03, 2024 at 11:22:28AM -0800, Carl E. Thompson wrote:
> 
> > On 2024-01-03 9:52 AM PST Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > ...
> 
> > > Could ZNS model affects a GC operations? Or, oppositely, ZNS model can
> > > help to manage GC operations more efficiently?
> > 
> > The ZNS model only adds restrictions on top of a regular block device,
> > so no it's not _helpful_ for our GC operations.
> 
> > ...
> 
> Could he be talking about the combination of bcachefs and internal
> drive garbage collection rather than only bcachefs garbage collection
> individually? I think the idea with many (most?) ZNS flash drives is
> that they don't have internal garbage collection at all and that the
> drive's erase/write cycles are more directly controlled / managed by
> the filesystem and OS block driver. I think the idea is supposed to be
> that the OS's drivers can manage garbage collection more efficiently
> that any generic drive firmware could. So the ZNS model is not just
> adding restrictions to a regular block devices, it's also shifting the
> responsibility for the drive's **internal** garbage collection to the
> OS drivers which is supposed to improve efficiency.
> 
> Or I could be completely wrong because this is not an area of
> expertise for me.

Yeah nothing really changes for bcachefs, GC-wise. We already have to
have copygc, and it works the same with ZNS as without.

The only difference is that with the SMR hard drivers buckets are a lot
bigger than you'd otherwise pick, but how much that affects you is
entirely dependent on your workload (random overwrites or no).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-03 17:52       ` Kent Overstreet
@ 2024-01-03 19:22         ` Carl E. Thompson
  2024-01-03 22:26           ` Kent Overstreet
  2024-01-04  7:43         ` Viacheslav Dubeyko
  1 sibling, 1 reply; 15+ messages in thread
From: Carl E. Thompson @ 2024-01-03 19:22 UTC (permalink / raw)
  To: Kent Overstreet, Viacheslav Dubeyko; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel


> On 2024-01-03 9:52 AM PST Kent Overstreet <kent.overstreet@linux.dev> wrote:

> ...

> > Could ZNS model affects a GC operations? Or, oppositely, ZNS model can
> > help to manage GC operations more efficiently?
> 
> The ZNS model only adds restrictions on top of a regular block device,
> so no it's not _helpful_ for our GC operations.

> ...

Could he be talking about the combination of bcachefs and internal drive garbage collection rather than only bcachefs garbage collection individually? I think the idea with many (most?) ZNS flash drives is that they don't have internal garbage collection at all and that the drive's erase/write cycles are more directly controlled / managed by the filesystem and OS block driver. I think the idea is supposed to be that the OS's drivers can manage garbage collection more efficiently that any generic drive firmware could. So the ZNS model is not just adding restrictions to a regular block devices, it's also shifting the responsibility for the drive's **internal** garbage collection to the OS drivers which is supposed to improve efficiency.

Or I could be completely wrong because this is not an area of expertise for me.

Carl

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-03  7:39     ` Viacheslav Dubeyko
@ 2024-01-03 17:52       ` Kent Overstreet
  2024-01-03 19:22         ` Carl E. Thompson
  2024-01-04  7:43         ` Viacheslav Dubeyko
  0 siblings, 2 replies; 15+ messages in thread
From: Kent Overstreet @ 2024-01-03 17:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel

On Wed, Jan 03, 2024 at 10:39:50AM +0300, Viacheslav Dubeyko wrote:
> 
> 
> > On Jan 2, 2024, at 7:05 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
> >> 
> >> 
> >>> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >>> 
> >>> LSF topic: bcachefs status & roadmap
> >>> 
> >> 
> >> <skipped>
> >> 
> >>> 
> >>> A delayed allocation for btree nodes mode is coming, which is the main
> >>> piece needed for ZNS support
> >>> 
> >> 
> >> I could miss some emails. But have you shared the vision of ZNS support
> >> architecture for the case of bcachefs already? It will be interesting to hear
> >> the high-level concept.
> > 
> > There's not a whole lot to it. bcache/bcachefs allocation is already
> > bucket based, where the model is that we allocate a bucket, then write
> > to it sequentially and never overwrite until the whole bucket is reused.
> > 
> > The main exception has been btree nodes, which are log structured and
> > typically smaller than a bucket; that doesn't break the "no overwrites"
> > property ZNS wants, but it does mean writes within a bucket aren't
> > happening sequentially.
> > 
> > So I'm adding a mode where every time we do a btree node write we write
> > out the whole node to a new location, instead of appending at an
> > existing location. It won't be as efficient for random updates across a
> > large working set, but in practice that doesn't happen too much; average
> > btree write size has always been quite high on any filesystem I've
> > looked at.
> > 
> > Aside from that, it's mostly just plumbing and integration; bcachefs on
> > ZNS will work pretty much just the same as bcachefs on regular block devices.
> 
> I assume that you are aware about limited number of open/active zones
> on ZNS device. It means that you can open for write operations
> only N zones simultaneously (for example, 14 zones for the case of WDC
> ZNS device). Can bcachefs survive with such limitation? Can you limit the number
> of buckets for write operations?

Yes, open/active zones correspond to write points in the bcachefs
allocator. The default number of write points is 32 for user writes plus
a few for internal ones, but it's not a problem to run with fewer.

> Another potential issue could be the zone size. WDC ZNS device introduces
> 2GB zone size (with 1GB capacity). Could be the bucket is so huge? And could
> btree model of operations works with such huge zones?

Yes. It'll put more pressure on copying garbage collection, but that's
about it.

> Technically speaking, limitation (14 open/active zones) could be the factor of
> performance degradation. Could such limitation doesn’t effect the bcachefs
> performance?

I'm not sure what performance degradation you speak of, but no, that
won't affect bcachefs. 

> Could ZNS model affects a GC operations? Or, oppositely, ZNS model can
> help to manage GC operations more efficiently?

The ZNS model only adds restrictions on top of a regular block device,
so no it's not _helpful_ for our GC operations.

But: since our existing allocation model maps so well to zones, our
existing GC model won't be hurt either, and doing GC in the filesystem
will naturally have benefits in that we know exactly what data is live
and we have access to the LBA mapping so can better avoid fragmentation.

> Do you need in conventional zone? Could bcachefs work without using
> the conventional zone of ZNS device?

Not required, but if zones are all 1GB+ you'd want a small conventional
zone so as to avoid burning two whole zones for the superblock.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-02 16:05   ` Kent Overstreet
@ 2024-01-03  7:39     ` Viacheslav Dubeyko
  2024-01-03 17:52       ` Kent Overstreet
  0 siblings, 1 reply; 15+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-03  7:39 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel



> On Jan 2, 2024, at 7:05 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
>> 
>> 
>>> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>> 
>>> LSF topic: bcachefs status & roadmap
>>> 
>> 
>> <skipped>
>> 
>>> 
>>> A delayed allocation for btree nodes mode is coming, which is the main
>>> piece needed for ZNS support
>>> 
>> 
>> I could miss some emails. But have you shared the vision of ZNS support
>> architecture for the case of bcachefs already? It will be interesting to hear
>> the high-level concept.
> 
> There's not a whole lot to it. bcache/bcachefs allocation is already
> bucket based, where the model is that we allocate a bucket, then write
> to it sequentially and never overwrite until the whole bucket is reused.
> 
> The main exception has been btree nodes, which are log structured and
> typically smaller than a bucket; that doesn't break the "no overwrites"
> property ZNS wants, but it does mean writes within a bucket aren't
> happening sequentially.
> 
> So I'm adding a mode where every time we do a btree node write we write
> out the whole node to a new location, instead of appending at an
> existing location. It won't be as efficient for random updates across a
> large working set, but in practice that doesn't happen too much; average
> btree write size has always been quite high on any filesystem I've
> looked at.
> 
> Aside from that, it's mostly just plumbing and integration; bcachefs on
> ZNS will work pretty much just the same as bcachefs on regular block devices.

I assume that you are aware about limited number of open/active zones
on ZNS device. It means that you can open for write operations
only N zones simultaneously (for example, 14 zones for the case of WDC
ZNS device). Can bcachefs survive with such limitation? Can you limit the number
of buckets for write operations?

Another potential issue could be the zone size. WDC ZNS device introduces
2GB zone size (with 1GB capacity). Could be the bucket is so huge? And could
btree model of operations works with such huge zones?

Technically speaking, limitation (14 open/active zones) could be the factor of
performance degradation. Could such limitation doesn’t effect the bcachefs
performance?

Could ZNS model affects a GC operations? Or, oppositely, ZNS model can
help to manage GC operations more efficiently?

Do you need in conventional zone? Could bcachefs work without using
the conventional zone of ZNS device?

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-02  8:02 ` Viacheslav Dubeyko
@ 2024-01-02 16:05   ` Kent Overstreet
  2024-01-03  7:39     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2024-01-02 16:05 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel

On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
> 
> 
> > On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > LSF topic: bcachefs status & roadmap
> > 
> 
> <skipped>
> 
> > 
> > A delayed allocation for btree nodes mode is coming, which is the main
> > piece needed for ZNS support
> > 
> 
> I could miss some emails. But have you shared the vision of ZNS support
> architecture for the case of bcachefs already? It will be interesting to hear
> the high-level concept.

There's not a whole lot to it. bcache/bcachefs allocation is already
bucket based, where the model is that we allocate a bucket, then write
to it sequentially and never overwrite until the whole bucket is reused.

The main exception has been btree nodes, which are log structured and
typically smaller than a bucket; that doesn't break the "no overwrites"
property ZNS wants, but it does mean writes within a bucket aren't
happening sequentially.

So I'm adding a mode where every time we do a btree node write we write
out the whole node to a new location, instead of appending at an
existing location. It won't be as efficient for random updates across a
large working set, but in practice that doesn't happen too much; average
btree write size has always been quite high on any filesystem I've
looked at.

Aside from that, it's mostly just plumbing and integration; bcachefs on
ZNS will work pretty much just the same as bcachefs on regular block devices.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2024-01-01 22:56 Kent Overstreet
@ 2024-01-02  8:02 ` Viacheslav Dubeyko
  2024-01-02 16:05   ` Kent Overstreet
  0 siblings, 1 reply; 15+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-02  8:02 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-bcachefs, linux-fsdevel



> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> LSF topic: bcachefs status & roadmap
> 

<skipped>

> 
> A delayed allocation for btree nodes mode is coming, which is the main
> piece needed for ZNS support
> 

I could miss some emails. But have you shared the vision of ZNS support
architecture for the case of bcachefs already? It will be interesting to hear
the high-level concept.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [LSF/MM/BPF TOPIC] bcachefs
@ 2024-01-01 22:56 Kent Overstreet
  2024-01-02  8:02 ` Viacheslav Dubeyko
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2024-01-01 22:56 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-bcachefs, linux-fsdevel

LSF topic: bcachefs status & roadmap

I'd like to make this as open ended as possible; I'll talk a bit about
current status and work on the technical innards, but I want the focus
to be more on

 - gathering ideas for the roadmap
 - what we can do to make it easier for poeple to contribtue
 - finding out what people are interested in
 - general brainstorming

Status wise: things are stabilizing nicely (knock on wood, 6.7 about to
come out and then we really find out how well I've done); CI test
failures slowly but steadily declining and user reported bugs are
getting fixed quickly and easily.

Some performance stuff to investigate still - we're still slower than we
ought to be on certain metadata workloads, haven't lokoed at why yet;
fsck times can be slow on very large filesystems, and we know what needs
to be addressed there - but users seem to be pretty happy with
scalability past the 100 TB mark.

A good chunk of online fsck is landing in 6.8 - fully functional for the
passes that are supported.

Disk space accounting rewrite is well underway - huge project but the
main challenges have been solved, when that's done we'll be able to add
per snapshot ID accounting, as well as per compression type accounting,
and it'll be a lot easier to add new counters.

Some cool forwards compat stuff just landed (bch_sb_downgrade section;
tells older versions what to do if they mount a newer filesystem that
needs fsck to use).

A delayed allocation for btree nodes mode is coming, which is the main
piece needed for ZNS support

Further off: some super fancy autotiering stuff is getting designed
(we should eventually be able to track data hotness on on an
inode:offset basis, which will eventually let us use more than 2
performance tiers of devices).

Erasure coding will finally be getting stabilized post disk space
accounting rewrite (which solves some annoying corner cases).

Cheers,
Kent

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2023-03-07  7:25 ` Matthew Wilcox
@ 2023-03-07  7:59   ` Kent Overstreet
  0 siblings, 0 replies; 15+ messages in thread
From: Kent Overstreet @ 2023-03-07  7:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-bcachefs

On Tue, Mar 07, 2023 at 07:25:15AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 22, 2023 at 02:46:31PM -0500, Kent Overstreet wrote:
> > I'd like to talk more about where things are at, long term goals, and
> > finally upstreaming this beast.
> 
> We don't have any rules about when we decide to upstream a new filesystem.
> There are at least four filesystems considering inclusion right now;
> bcachefs, SSDFS, composefs and nvfs.  Every new filesystem imposes
> certain costs on other developers (eg those who make sweeping API changes,
> *cough*).  I don't think we've ever articulated a clear set of criteria,
> and maybe we can learn from the recent pain of accepting ntfs3 in the
> upstream kernel.

I've been thinking about what's good for the filesystem.

I've been leery about upstreaming too soon, with unfinished important
features or major design work still to be done. When it's upstream, I'll
have to spend a lot more of my time in a maintainer role, and it's going
to be harder to find time for multi-month projects that require deep
focus; like snapshots, or the allocator rewrite, or backpointers, or
right now erasure coding.

And once it's upstream the pressure is going to be to keep things
stable, to not break things that will affect users. Whereas right now,
I've got a testing community that's smaller, more forgiving of temporary
bugs, and people work with me and give me feedback. That feedback is
important and guides what I work on; it's driven a lot of the
scalability work over the past few years. We've got people running it on
100 TB arrays right now (I want to hear from the first person to test it
on a 1 PB array!) - it took a lot of work to get there.

It's still not _quite_ where I want it to be. Snapshots needs a bit more
work - the deletion path needs more torture testing, and I'm adamant on
not shipping without solid working erasure coding. Right now I'm working
on getting the erasure coding copygc torture test passing; the
fundamentals are there but I've probably got another month of work to to
get it all polished and thoroughly debugged.

But, you gotta ship someday :) and the feedback from users has
increasingly been "yeah, it's been solid and trouble free", so... maybe
it is finally just about time :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2023-02-22 19:46 Kent Overstreet
  2023-03-07  5:01 ` Darrick J. Wong
@ 2023-03-07  7:25 ` Matthew Wilcox
  2023-03-07  7:59   ` Kent Overstreet
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-03-07  7:25 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-fsdevel, linux-bcachefs

On Wed, Feb 22, 2023 at 02:46:31PM -0500, Kent Overstreet wrote:
> I'd like to talk more about where things are at, long term goals, and
> finally upstreaming this beast.

We don't have any rules about when we decide to upstream a new filesystem.
There are at least four filesystems considering inclusion right now;
bcachefs, SSDFS, composefs and nvfs.  Every new filesystem imposes
certain costs on other developers (eg those who make sweeping API changes,
*cough*).  I don't think we've ever articulated a clear set of criteria,
and maybe we can learn from the recent pain of accepting ntfs3 in the
upstream kernel.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2023-03-07  5:01 ` Darrick J. Wong
@ 2023-03-07  6:12   ` Kent Overstreet
  0 siblings, 0 replies; 15+ messages in thread
From: Kent Overstreet @ 2023-03-07  6:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: lsf-pc, linux-fsdevel, linux-bcachefs

On Mon, Mar 06, 2023 at 09:01:46PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 22, 2023 at 02:46:31PM -0500, Kent Overstreet wrote:
> > Hi, I'd like to give an update on bcachefs progress and talk about
> > upstreaming.
> > 
> > There's been a lot of activity over the past year or so:
> >  - allocator rewrite
> >  - cycle detector for deadlock avoidance
> 
> XFS has rather a lot of locks and no ability to unwind a transaction
> that has already dirtied incore state.  I bet you and I could have some
> very interesting discussions about how to implement robust tx undo in a
> filesystem.

Yeah, the main basic idea in bcachefs is that in any filesystem metadata
operation (say a create, or a chmod), we do it entirely within a
bcachefs btree transaction, and then only update vfs metadata (icache,
dcache) after a succesful transaction commit (and with btree locks still
held).

Not exactly the usual way of doing things - lots of the standard utility
code (e.g. posix_acl_chmod(), IIRC) wants to use the VFS inode as the
source of truth and the place to do the mutation. We don't want to do
that; we want to lock the inode in the btree, do the mutation, and then
update the VFS inode after transaction commit.

But the end result is really clean, and a lot of awkwardness in terms of
locking and error paths goes away with this model.

> (Not sure the /rest/ of the lsf crowd are going to care, but I do.)
> 
> >  - backpointers
> >  - test infrastructure!
> 
> <cough> "test dashboard that we can all share" ?

Possibly :) Getting a good CI going where I can just chuck code at it
and get results for the entire test suite (fstests and more) in a timely
manner (~20 minutes) has been huge for my productivity, and I'd really
love for others to be able to make use of this too.

Right now though the test results collector needs to be more scalable -
I should've gone with a database for test results from the start like
you did :) But if that happens, then yeah, I'd love to add more servers
to the cluster and have a big CI cluster for all of us filesystem
developers.

Having 10 or 20 of these 80 core arm64 servers (right now I'm renting 3)
so they all immediately start testing our git repos as we push would be
_huge_.

> >  - starting to integrate rust code (!)
> 
> I'm curious to hear about this topic, because I look at rust, and I look
> at supercomplex filesystem code and wonder how in the world we're ever
> going to port a (VERY SIMPLE) filesystem to Rust.  Now that I'm nearly
> done with online repair for XFS, there's a lot of stupid crap about C
> that I would like to start worrying about less because some other
> language added enough guard rails to avoid the stupid.

God yes, I can't wait to be writing new code in Rust instead of C. I
_like_ C, it's my mother tongue, but I'm ready for something better and
it's going to make for _drastically_ more maintainable code with less
debugging in the future.

Fully writing or rewriting any existing filesystem in Rust might take
another decade, but with bcachefs since we've got the btree/database
API, we've got a pretty small surface we can create a safe Rust wrapper
for, and then there's huge swaths of code that we can immediately start
incrementally converting.

I've already got the safe btree interface started, and I just merged the
first rewrite of existing C code - the 'bcachefs list' debug tool:

https://evilpiepirate.org/git/bcachefs-tools.git/tree/rust-src/src/cmd_list.rs

compare with the old C version:

https://evilpiepirate.org/git/bcachefs-tools.git/tree/cmd_list.c?id=0206d42daf4c4bd3bbcfa15a2bef34319524db49

Baby steps! but it's happening.

Some of the things I'm really fond of:
 - the Ord trait, which makes comparisons much more readable

 - the Display trait, which means we can pass _any object we want_ to
   the Rust equivalent of printf/printk!

 - Error handling! Result and ? are amazing.

 - No more iterator invalidation bugs! After you advance an iterator,
   the borrow checker _will not let you_ try to use the key peek()
   previously returned.

   The next step in the Rust btree wrapper is to teach the borrow
   checker about the semantics of bch2_trans_unlock() and
   bch2_trans_begin(). Good stuff.

I've been writing lots of random things in Rust lately - assorted
tooling, parts of ktest, and I'm just really fond of the language. I
can't emphasize enough how much thoughtfulness has gone into the
language; I haven't seen another language where it seems like they've
stolen every single good idea out there and not a single bad one.
(Idris, perhaps? I need to dig into that one, dependent types are
another thing that's going to be big in the future).

The Rust transition isn't the main thing I'm spending my time on right
now (that would be erasure coding) - but it really is the thing I'm most
excited about.

Fun times...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bcachefs
  2023-02-22 19:46 Kent Overstreet
@ 2023-03-07  5:01 ` Darrick J. Wong
  2023-03-07  6:12   ` Kent Overstreet
  2023-03-07  7:25 ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2023-03-07  5:01 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: lsf-pc, linux-fsdevel, linux-bcachefs

On Wed, Feb 22, 2023 at 02:46:31PM -0500, Kent Overstreet wrote:
> Hi, I'd like to give an update on bcachefs progress and talk about
> upstreaming.
> 
> There's been a lot of activity over the past year or so:
>  - allocator rewrite
>  - cycle detector for deadlock avoidance

XFS has rather a lot of locks and no ability to unwind a transaction
that has already dirtied incore state.  I bet you and I could have some
very interesting discussions about how to implement robust tx undo in a
filesystem.

(Not sure the /rest/ of the lsf crowd are going to care, but I do.)

>  - backpointers
>  - test infrastructure!

<cough> "test dashboard that we can all share" ?

>  - starting to integrate rust code (!)

I'm curious to hear about this topic, because I look at rust, and I look
at supercomplex filesystem code and wonder how in the world we're ever
going to port a (VERY SIMPLE) filesystem to Rust.  Now that I'm nearly
done with online repair for XFS, there's a lot of stupid crap about C
that I would like to start worrying about less because some other
language added enough guard rails to avoid the stupid.

>  - lots more bug squashing, scalability work, debug tooling improvements
> 
> I'd like to talk more about where things are at, long term goals, and
> finally upstreaming this beast.

Go for it, I say.

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [LSF/MM/BPF TOPIC] bcachefs
@ 2023-02-22 19:46 Kent Overstreet
  2023-03-07  5:01 ` Darrick J. Wong
  2023-03-07  7:25 ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Kent Overstreet @ 2023-02-22 19:46 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel, linux-bcachefs

Hi, I'd like to give an update on bcachefs progress and talk about
upstreaming.

There's been a lot of activity over the past year or so:
 - allocator rewrite
 - cycle detector for deadlock avoidance
 - backpointers
 - test infrastructure!
 - starting to integrate rust code (!)
 - lots more bug squashing, scalability work, debug tooling improvements

I'd like to talk more about where things are at, long term goals, and
finally upstreaming this beast.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-01-04  7:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-02 23:07 [LSF/MM/BPF TOPIC] bcachefs Kent Overstreet
2021-06-30 18:18 ` Dan Robertson
2023-02-22 19:46 Kent Overstreet
2023-03-07  5:01 ` Darrick J. Wong
2023-03-07  6:12   ` Kent Overstreet
2023-03-07  7:25 ` Matthew Wilcox
2023-03-07  7:59   ` Kent Overstreet
2024-01-01 22:56 Kent Overstreet
2024-01-02  8:02 ` Viacheslav Dubeyko
2024-01-02 16:05   ` Kent Overstreet
2024-01-03  7:39     ` Viacheslav Dubeyko
2024-01-03 17:52       ` Kent Overstreet
2024-01-03 19:22         ` Carl E. Thompson
2024-01-03 22:26           ` Kent Overstreet
2024-01-04  7:43         ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).