archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <>
Subject: bcachefs status update
Date: Fri, 28 Oct 2022 15:23:45 -0400	[thread overview]
Message-ID: <Y1wsQVSQ6ipWyFlX@moria.home.lan> (raw)

New allocator

The old allocator, fundamentally unchanged from bcache had been thoroughly
outgrown and was in need of a rewrite. It kept allocation information pinned in
memory and it did periodic full scans of that aloc info to find free buckets,
which had become a real scalability issue. It was also problematic to debug,
being based around purely in-memory state with tricky locking and tricky state

That's all been completely rewritten and replaced with new algorithms based on
new persistent data structures. The new code is simpler, vastly more scalable
(and we had users with 50 TB filesystems before), and since all state
transitions show up in the journal it's been much easier to debug.


Backpointers are enable doing a lookup from physical device:lba to the extent
that owns that lba. There's a number of reasons filesystems might want them, but
in our context we specifically needed them to improve copygc scalability.
Before, copygc had to periodically walk the entire extents + reflink btrees; now
it just picks the next-most-empty bucket and moves all the extents it contains.

With backpointers done we're now largely done with scalability work - excepting
online fsck. We also still need to add a rebalance_work btree to fix rebalance
scanning, but that's not as serious since we never depend on rebalance for
allocations to make forward progress, but this'll be a relatively small chunk of

Snapshots largely stabilized

Quotas don't yet work with snapshots, but aside from this all tests are passing.
There's still a few minor bugs we're trying to get reproduced, but nothing that
should affect system availability (exception: snapshot delete path still sucks,
the code to tear down the pagecache should be improved).

Erasure coding (RAID 5/6) getting close to usable

The codepath for deciding when to create new stripes or update an existing,
partially-empty stripe still needs to be improved.

Background: bcachefs avoids the write hole problem in other raid
implementations (by being COW), and it doesn't fragment writes like ZFS does;
stripes are big (which we want for avoiding fragmentation), and they're created
on demand out of physical buckets on different devices.

Foreground writes are initially replicated, then when we accumulate a full
stripe we write out p+q blocks and update extents in the stripe with the stripe
information and drop the now unnneeded replicas. The buckets containing the
extra replicas will get reused right away, so if we don't have to send a cache
flush before they're overwritten with something else they only cost bus

As data gets overwritten, or moved by copygc, we'll end up with some of the
buckets in a stripe becoming empty. The stripe creation path has the ability to
create a new stripe using the buckets that still contain live data in an
existing stripe - but it's more efficient in terms of IO to create new stripes
if the data in an existing stripe is going to die at some point in the future -
OTOH it requires more disk space.

Once this logic is figured out, erasure coding should be pretty close to ready.

Lock cycle detector

We'd outgrown the old deadlock avoidance strategy: before blocking on a lock
we'd check for lock ordering violation, and if necessary issue a transaction
restart which would re-traverse iterators in the correct order. The lock
ordering rules had become too complicated and this was getting us too many
transaction restarts, so I stole the standard technique from databases, which is
now working beautifully - transaction restarts due to deadlock are a tiny
fraction of what they were before.

Performance work

We now have _lots_ of persistent counters, including for transaction restarts
(of every type), and every slowpath - and, the automated tests now check these
counters at the end of every test, and fail the test if any of them were too

And, thanks to a lot of miscellaneous performance work, our 4k O_DIRECT random
write performance is now > 50% better than it was a few months ago.

Automated test infrastructure!

I finally built the CI I always wanted: you point it at a git branch and it runs
tests and gives you the results in a git log view. See it in action here:

Since then I've been putting a ton of time into grinding through bugs, and the
the new test dashboard has been amazing for tracking down regressions.

There's not much more work planned prior to upstreaming: on disk format changes
have slowed down considerably. I just revved the on disk format version to
introduce a new inode format (which doesn't varint encode i_sectors or i_size,
which makes the data write path faster), and I'm going to _attempt_ to expand
the u64s field of struct bkey from one byte to two, but other than that -
nothing big expected for awhile.

I also just gave a talk to RH staff - lots of good stuff in there:

Cheers, and thanks for reading

                 reply	other threads:[~2022-10-28 19:23 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1wsQVSQ6ipWyFlX@moria.home.lan \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).