All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wout Mertens <wout.mertens@gmail.com>
To: linux-bcachefs@vger.kernel.org
Subject: content-addressed storage?
Date: Mon, 25 Apr 2022 07:37:51 +0200	[thread overview]
Message-ID: <CAO3V83+SSfYXLj18NvkVzJ68K3LxdPQnG0U-Zhnq+ZQLp5p0wA@mail.gmail.com> (raw)

Hi,

I'm idly wondering if bcachefs could support lookups and reuse of
variable-size chunks based on their hash checksum, thus allowing
deduplication.

So, instead of using a user-level storage layer like
https://github.com/systemd/casync, it would be a layer deeper, inside
bcachefs.

Would that be more optimal in memory/cpu/disk, or would it not make
much difference besides comfort?

As a quick overview, casync splits files into chunks. The chunks are
variable-sized based on their content using a rolling hash, which
makes insertions and deletions change only the affected blocks and not
the blocks following them. Then, the chunks are compressed and hashed,
and stored under their hash.
So files stored this way are lists of variable-size hashed chunks. If
two files are similar, they will be sharing a lot of the chunks
automatically.

A straightforward implementation would result in a single directory
with millions/billions of small files. Is that something bcachefs can
handle?

Inside bcachefs I'd instead imagine a btree with all the chunks as
extents, keyed by their hash, and files somehow being able to
arbitrarily point at those extents.

Thoughts? Just musing.

Wout.

             reply	other threads:[~2022-04-25  5:38 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-25  5:37 Wout Mertens [this message]
2022-05-07 18:09 ` content-addressed storage? Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAO3V83+SSfYXLj18NvkVzJ68K3LxdPQnG0U-Zhnq+ZQLp5p0wA@mail.gmail.com \
    --to=wout.mertens@gmail.com \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.