All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, me@ttaylorr.com, gitster@pobox.com,
	abhishekkumar8222@gmail.com,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 5/7] commit-graph: document file format v2
Date: Fri, 25 Feb 2022 23:31:59 +0100	[thread overview]
Message-ID: <220225.86a6ee7eid.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <7f9b65bd22551fd7fd5d2f0bf18aee8c25f1db99.1645735117.git.gitgitgadget@gmail.com>


On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:

> The corrected commit date was first documented in 5a3b130ca (doc: add
> corrected commit date info, 2021-01-16) and it used an optional chunk to
> augment the commit-graph format without modifying the file format
> version.
>
> One major benefit to this approach is that corrected commit dates could
> be written without causing a backwards compatibility issue with Git
> versions that do not understand them. The topological level was still
> available in the CDAT chunk as it was before.
>
> However, this causes a different issue: more data needs to be loaded
> from disk when parsing commits from the commit-graph. In cases where
> there is no significant algorithmic gain from using corrected commit
> dates, commit walks take up to 20% longer because of this extra data.
>
> Create a new file format version for the commit-graph format that
> differs only in the CDAT chunk: it now stores corrected commit date
> offsets. This brings our data back to normal and will demonstrate
> performance gains in almost all cases.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  .../technical/commit-graph-format.txt         | 22 ++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 87971c27dd7..2cb48993314 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -36,7 +36,7 @@ HEADER:
>        The signature is: {'C', 'G', 'P', 'H'}
>  
>    1-byte version number:
> -      Currently, the only valid version is 1.
> +      This version number can be 1 or 2.
>  
>    1-byte Hash Version
>        We infer the hash length (H) from this value:
> @@ -85,13 +85,22 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the topological level (generation number v1)
> -      of the commit and
> -      the commit time in seconds since EPOCH. The generation number
> -      uses the higher 30 bits of the first 4 bytes, while the commit
> +    * The next 8 bytes store the generation number information of the
> +      commit and the commit time in seconds since EPOCH. The generation
> +      number uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
> +      - If the commit-graph file format is version 1, then the higher 30
> +	bits contain the topological level (generation number v1) for the
> +	commit.
> +      - If the commit-graph file format is version 2, then the higher 30
> +	bits contain the corrected commit date offset (generation number
> +	v2) for the commit, except if the offset cannot be stored within
> +	29 bits. If the offset is too large for 29 bits, then the value
> +	stored here has its most-significant bit on and the other bits
> +	store the position of the corrected commit date in the Generation
> +	Date Overflow chunk.
>  
>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>      * This list of 4-byte values store corrected commit date offsets for the
> @@ -103,6 +112,9 @@ CHUNK DATA:
>      * Generation Data chunk is present only when commit-graph file is written
>        by compatible versions of Git and in case of split commit-graph chains,
>        the topmost layer also has Generation Data chunk.
> +    * This chunk does not exist if the commit-graph file format version is 2,
> +      because the corrected commit date offset data is stored in the Commit
> +      Data chunk.
>  
>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>      * This list of 8-byte values stores the corrected commit date offsets

We talked a while ago now about how we do commit-graph format changes
and this is partially echoing those earlier questions[1] from 2019.

I fully understand why we're writing this amended CDAT chunk in a
different layout. By not having the GDAT side-chunk to look up in the
data is more local, that part of the file is more compact etc.

What I don't understand is why getting those performance improvements
requires the breaking version change & the writing of the incompatible
version number.

I.e. couldn't the differently formatted CDAT chunk be written instead to a new
chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
a possibly empty chunk (I didn't re-do those numbers), but surely the
performance improvements will be about the same for that miniscule
overhead.

It will give you something you can't have here, which is optional
compatibility with older clients by writing both versions. That'll be a
~2x as large file on disk, but with the page cache & each client version
skipping to the data it needs caching characteristics & data locality
should work out to about the same thing.

Or maybe they won't. I just found it surprising when reviewing this to
not find an answer to why that approach wasn't
considered.

E.g. 76ffbca71a9 (commit-graph: write Bloom filters to commit graph
file, 2020-04-06) is a commit adding such new optional and
backwards-compatible data.

1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/

  parent reply	other threads:[~2022-02-25 22:44 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-24 22:15   ` Junio C Hamano
2022-02-25 13:51     ` Derrick Stolee
2022-02-25 17:35       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:18   ` Patrick Steinhardt
2022-02-28 16:23     ` Derrick Stolee
2022-02-28 16:59       ` Patrick Steinhardt
2022-02-28 18:44         ` Derrick Stolee
2022-03-01  9:46           ` Patrick Steinhardt
2022-03-01 10:35             ` Patrick Steinhardt
2022-03-01 14:06               ` Derrick Stolee
2022-03-01 14:53                 ` Patrick Steinhardt
2022-03-01 15:25                   ` Derrick Stolee
2022-03-02 13:57                     ` Patrick Steinhardt
2022-03-02 14:57                       ` Derrick Stolee
2022-03-02 18:15                         ` Junio C Hamano
2022-03-02 18:46                           ` Derrick Stolee
2022-03-02 22:42                             ` Junio C Hamano
2022-03-03 11:19                         ` Patrick Steinhardt
2022-03-03 16:00                           ` Derrick Stolee
2022-03-04 14:03                             ` Derrick Stolee
2022-03-07 10:34                               ` Patrick Steinhardt
2022-03-07 13:45                                 ` Derrick Stolee
2022-03-07 17:22                                   ` Junio C Hamano
2022-03-10 13:58                                   ` Patrick Steinhardt
2022-03-10 17:18                                     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-24 22:35   ` Junio C Hamano
2022-02-25 13:53     ` Derrick Stolee
2022-02-25 17:38       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
2022-02-24 22:55   ` Junio C Hamano
2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason [this message]
2022-02-28 13:44     ` Derrick Stolee
2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
2022-02-28 16:39         ` Derrick Stolee
2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
2022-03-01 14:19             ` Derrick Stolee
2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
2022-03-01 15:59                 ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
2022-02-24 23:01   ` Junio C Hamano
2022-02-25 13:54     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 7/7] commit-graph: write " Derrick Stolee via GitGitGadget
2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
2022-02-24 23:06   ` Junio C Hamano
2022-02-25 13:55     ` Derrick Stolee
2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-28 15:22     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-28 15:25     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
2022-02-28 16:43       ` Derrick Stolee
2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-28 15:40     ` Ævar Arnfjörð Bjarmason
2022-03-01 17:23   ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Ævar Arnfjörð Bjarmason
2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-03-01 20:13       ` Junio C Hamano
2022-03-01 20:30         ` Junio C Hamano
2022-03-02 14:13           ` Derrick Stolee
2022-03-01 19:48     ` [PATCH v3 4/5] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220225.86a6ee7eid.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.