content-addressed storage?

* content-addressed storage?
@ 2022-04-25  5:37 Wout Mertens
  2022-05-07 18:09 ` Kent Overstreet
  0 siblings, 1 reply; 2+ messages in thread
From: Wout Mertens @ 2022-04-25  5:37 UTC (permalink / raw)
  To: linux-bcachefs

Hi,

I'm idly wondering if bcachefs could support lookups and reuse of
variable-size chunks based on their hash checksum, thus allowing
deduplication.

So, instead of using a user-level storage layer like
https://github.com/systemd/casync, it would be a layer deeper, inside
bcachefs.

Would that be more optimal in memory/cpu/disk, or would it not make
much difference besides comfort?

As a quick overview, casync splits files into chunks. The chunks are
variable-sized based on their content using a rolling hash, which
makes insertions and deletions change only the affected blocks and not
the blocks following them. Then, the chunks are compressed and hashed,
and stored under their hash.
So files stored this way are lists of variable-size hashed chunks. If
two files are similar, they will be sharing a lot of the chunks
automatically.

A straightforward implementation would result in a single directory
with millions/billions of small files. Is that something bcachefs can
handle?

Inside bcachefs I'd instead imagine a btree with all the chunks as
extents, keyed by their hash, and files somehow being able to
arbitrarily point at those extents.

Thoughts? Just musing.

Wout.

^ permalink raw reply	[flat|nested] 2+ messages in thread