All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Taylor Blau <me@ttaylorr.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, szeder.dev@gmail.com,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 00/15] Refactor chunk-format into an API
Date: Wed, 9 Dec 2020 19:50:24 -0500	[thread overview]
Message-ID: <X9Fw0Pzn5wsy3wt0@nand.local> (raw)
In-Reply-To: <4696bd93-9406-0abd-25ec-a739665a24d5@web.de>

On Wed, Dec 09, 2020 at 06:13:18PM +0100, René Scharfe wrote:
> I'm not too familiar with the chunk producers and consumers, so I can
> only offer some high-level observations.  And I don't have to use the
> API, so go wild! ;)  I was just triggered by the appearance of two
> working pieces of code being replaced by two slightly different pieces
> of code plus a third one on top.

:-).

> > With regards to the "YAGNI" comment... I do have thoughts about
> > extending the reachability bitmap format to use chunks (of course, this
> > would break compatibility with JGit, and it isn't something that I plan
> > to do in the short-term, or even necessarily in the future).
> >
> > In any event, I'm sure that this won't be these two won't be the last
> > chunk-based formats that we have in Git.
>
> OK, so perhaps we can do better before this scheme is copied.  The write
> side is complicated by the fact that the table of contents (TOC) is
> written first, followed by the actual chunks.  That requires two passes
> over the data.

"Two passes" meaning that we have to both compute the size of and then
write the data? This is relatively cheap to do, at least so I think.

For e.g., the OIDLOOKUP commit-graph chunk is just the_hash_algo->hashsz
* commits->nr bytes wide, so that can be done in constant time. A more
heavyweight case might be for e.g., the Bloom data section, where Bloom
filters have to first be computed, their lengths accounted for, and
_then_ written when we eventually get to writing that chunk.

This happens in compute_bloom_filters(); and write_chunk_bloom_indexes()
+ write_chunk_bloom_data(), respectively. Those Bloom filters are all
stored in a commit slab until they are written, so these "two passes"
are just paid for in memory.

> The ZIP format solved a similar issue by placing the TOC at the end,
> which allows for one-pass streaming.  Another way to achieve that would
> be to put the TOC in a separate file, like we do for .pack and .idx
> files.  This way you could have a single write function for chunks, and
> writers would just be a single sequence of calls for the different
> types.

Interesting. I'm not opposed to changing any of these formats (and maybe
there is some compelling argument for doing so, I am not sure) but I
think that unifying the implementation for reading / writing the chunk
format _before_ changing it is a postive step.

> But seeing that the read side just loads all of the chunks anyway
> (skipping unknown IDs) I wonder why we need a TOC at all.  That would
> only be useful if callers were trying to read just some small subset
> of the whole file.  A collection of chunks for easy dumping and loading
> could be serialized by writing just a small header for each chunk
> containing its type and size followed by its payload.

AFAIK, we do use the table of contents to locate where the chunks are so
that we can for e.g., set up the commit_graph structure's pointers to
point at each chunk appropriately.

> René

Thanks,
Taylor

  reply	other threads:[~2020-12-10  0:51 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
2020-12-08 17:56   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 03/15] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 04/15] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 21:42   ` Junio C Hamano
2020-12-04 13:39     ` Derrick Stolee
2020-12-08 18:00   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 06/15] midx: add pack_perm " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 07/15] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
2020-12-03 21:50   ` Junio C Hamano
2020-12-04 13:40     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 09/15] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2020-12-08 18:42   ` Taylor Blau
2020-12-10 14:36     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2020-12-03 22:00   ` Junio C Hamano
2020-12-08 18:43     ` Taylor Blau
2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
2020-12-08 18:45   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
2020-12-03 22:17   ` Junio C Hamano
2020-12-04 13:47     ` Derrick Stolee
2020-12-04 20:17       ` Junio C Hamano
2020-12-03 22:43   ` Junio C Hamano
2020-12-04 13:45     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2020-12-07 13:43   ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 15/15] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
2020-12-04 13:57   ` Derrick Stolee
2020-12-04 19:42   ` Junio C Hamano
2020-12-08 18:49   ` Taylor Blau
2020-12-09 17:13     ` René Scharfe
2020-12-10  0:50       ` Taylor Blau [this message]
2020-12-10 14:30         ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X9Fw0Pzn5wsy3wt0@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=l.s.r@web.de \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.