All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, me@ttaylorr.com, gitster@pobox.com,
	abhishekkumar8222@gmail.com
Subject: Re: [PATCH 5/7] commit-graph: document file format v2
Date: Tue, 1 Mar 2022 09:19:06 -0500	[thread overview]
Message-ID: <9b52fdd3-64fc-34b1-d4ee-660b4fb73f39@github.com> (raw)
In-Reply-To: <220228.86ilsy3a8b.gmgdl@evledraar.gmail.com>

On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee wrote:
> 
>> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>>
>>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>>> not find an answer to why that approach wasn't
>>>>> considered.
>>>>
>>>> The point is to create a new format that can be chosen when deployed
>>>> in an environment where older Git versions will not exist (such as
>>>> a Git server). The new version is not chosen by default and instead
>>>> is opt-in through the commitGraph.generationVersion config option.
>>>>
>>>> Perhaps in a year or two we would consider making this the new
>>>> default, but there is no rush to do so.
>>>
>>> Looking into this a bit more I think that in either case this is less of
>>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>>> existing graph is corrupt, 2019-03-25), which came out of some of those
>>> discussions at the time of [1].
>>>
>>> I.e. now a client that only understands version N-1 will warn when
>>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>>> commit) reads the repository that we'd hard die on it, correct?
>>>
>>> But speaking of hyper-focus. I think that arguably applies to you in
>>> this case when considering the trade-offs of these sorts of format
>>> changes :)
>>>
>>> I.e. you're primarily considering cases of say a git server (presumably
>>> running on GitHub) or another such deployment where it's easy to have
>>> full control over all of your versions "in the wild".
>>
>> I'm thinking of servers, yes, but also 99% of clients who only upgrade
>> (or _maybe_ downgrade to a recent, previous version occasionally).
> 
> *nod*
> 
>>> And thus a three-phase rollout of something like a format change can be
>>> done in a timely and predictable manner.
>>>
>>> But git is used by *a lot* of people in a bunch of different
>>> scenarios. E.g.:
>>>
>>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>>  * A tarred-up directory including a .git, which may be transferred to
>>>    a machine with a pre-v2.22.0 version.
>>>
>>> Or even softer cases of failure, such as:
>>>
>>>  * A cronjob causes an alert/incident somewhere because the server 
>>>    operator started writing a new version, but forgot about a set
>>>    of machines that are still on the old version.
>>
>> It is important to continue supporting these cases, and this change does
>> not cause any issues for them.
> 
> The issues in those cases will range from warnings on older versions
> when loading the graph to errors if it's pre-v2.22.0, with the
> performance benefits v3 placing them out of range of v2-only clients.
> 
> I think arguable that's OK/worth it, but it's "not [any] issues", no?

What I mean is that this change does not enable the new graph version
by default, so these users do not have any issues unless someone opts
in to the feature while in this mixed scenario.

>> However, this handful of corner cases should not block progress in the
>> main cases.
> 
> What progress would be blocked?
> 
> I'm only talking about whether we choose to consider a "new graph" to be an:
> 
>     <existing version number>
>     <existing chunk name (old content, possibly empty)>
>     <new chunk name (new content)>
> 
> v.s.:
> 
>     <old/new version number>
>     <existing chunk name old/new (incompatible) content>
> 
> I.e. the "progress" this series is about is in getting the data locality
> with smaller data with the new content.
> 
> But that's also possible to get with a very low amount of fixed-overhead.
> 
> Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
> 2019, I haven't re-checked. In terms of wasted space it's miniscule &
> <1/4 of one FS page on Linux.

If you're talking "empty" data, then you need to have an empty Commit
Data chunk _and_ and empty OID Lookup chunk in order to not have
breakage. So you'd need duplicate versions of these chunks for the
new "Commit Data 2" chunk. Then we need special-casing for all of this
during parsing that is unnecessary complexity.

Finally, the end result becomes "older versions get slower without
any warning" instead of "older versions get a message about not
understanding the commit-graph file".
 
> I'm not just trying to rehash the same points, I *think* the version
> bump is just an aesthetic choice & we're not getting any performance
> difference out of that.
> 
> But I'm not sure from the "block progress" etc., so maybe I'm still
> missing something...

The fact that we have a Generation Data chunk instead of already
bumping the file format version number is already a concession to
this concern about backwards compatibility.

With the point above about empty Commit Data Chunks, the only way
to properly conserve backwards compatibility is to have a full
Commit Data Chunk as well as a second copy that contains the new
offsets instead of topological levels. This is wasteful.

>>> I think that even if it's less conceptually clean it's worth considering
>>> being over backwards to be kinder to such use-cases, unless it's really
>>> required for other reasons to break such in-the-wild use-cases.
>>>
>>> Or in this case, if it's thought to be worth it to help reviewers decide
>>> by separating the performance improvement aspect from the changed
>>> interaction between new graphs and older clients.
>>>
>>> As a further nit on the proposed end-state here: Do I understand it
>>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>>> "master") will always result in a file that's compatible with older
>>> versions, since the only thing "v2" there controls now is to write the
>>> optional GDAT and GDOV chunks?
>>>
>>> Whereas going from commitGraph.generationVersion=2 to
>>> commitGraph.generationVersion=3 in this series will impact older clients
>>> as noted above, since we're bumping the version (of the file, to 2 if
>>> the config is 3, which as Junio noted is a bit confusing).
>>>
>>> I think if you're set on going down the path of bumping the top-level
>>> version that deserves to be made much clearer in the added
>>> documentation. Right now the only hint to that is a passing mention that
>>> for v3:
>>>
>>>     [it] will be incompatible with some old versions of Git
>>>
>>> Which if we're opting for breaking format changes really should note
>>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>>> describe "some old versions of Git" a bit more clearly.
>>>
>>> It actually means once this gets released "the git version that was the
>>> latest one you could download yesterday". Which a reader of the docs
>>> probably won't expect when starting to play with this in mixed-version
>>> environment.
>>>
>>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>>
>> This documentation could be altered to be specific about versions,
>> but such a specific change makes assumptions of the version that will
>> include it. As of now, the generation number v2 fixes will _probably_
>> get in for 2.36 and the format change would have enough time to cook
>> for 2.37, so I'll update the docs to refer to that version explicitly.
> 
> ...
> 
>> The pre-2.22.0 change might be helpful to mention, but it could also be
>> noise to the reader. We can revisit this when these patches are
>> submitted again in another thread. There's also concern about third-
>> party tools like libgit2. I'd rather draw the line as "tread carefully
>> here" than "here is so much information that a reader might think it
>> is all they need to know".
> 
> In terms of concern about libgit2 or any other implementation (which I
> haven't looked at) isn't "tread carefully" to do it with new chunks if
> possible, which we've done before with BIDX/BDAT, v.s. a version bump we
> haven't done?

New chunks adding new information is part of the design. Changing
the location of existing data is new here.

> I'd think it wouldn't be an issue either way for any reader of the
> format, and libgit2 is more specialized & won't have someone on RHEL6 or
> whatever trying to inspect a random repo.
> 
> It just seems like a win-win to have a performance improvement with
> smooth backwards compatibility v.s. without, if that's possible.

You are right that it is _possible_, but I don't think that the
side-effects are worth it. Those being:

* "Empty CDAT Chunk": Silently slowing down older clients.
* "Duplicate CDAT Chunk": Wasted data.

Finally, I want to reiterate that by making this opt-in, users make
the call about whether or not they are in a scenario where this
compatibility issue is appropriate for them. This includes waiting
to see if third-party tools like libgit2 are updated to understand
this version.

Thanks,
-Stolee

  reply	other threads:[~2022-03-01 14:19 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-24 20:38 [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 1/7] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-24 20:38 ` [PATCH 2/7] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-24 22:15   ` Junio C Hamano
2022-02-25 13:51     ` Derrick Stolee
2022-02-25 17:35       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 3/7] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:18   ` Patrick Steinhardt
2022-02-28 16:23     ` Derrick Stolee
2022-02-28 16:59       ` Patrick Steinhardt
2022-02-28 18:44         ` Derrick Stolee
2022-03-01  9:46           ` Patrick Steinhardt
2022-03-01 10:35             ` Patrick Steinhardt
2022-03-01 14:06               ` Derrick Stolee
2022-03-01 14:53                 ` Patrick Steinhardt
2022-03-01 15:25                   ` Derrick Stolee
2022-03-02 13:57                     ` Patrick Steinhardt
2022-03-02 14:57                       ` Derrick Stolee
2022-03-02 18:15                         ` Junio C Hamano
2022-03-02 18:46                           ` Derrick Stolee
2022-03-02 22:42                             ` Junio C Hamano
2022-03-03 11:19                         ` Patrick Steinhardt
2022-03-03 16:00                           ` Derrick Stolee
2022-03-04 14:03                             ` Derrick Stolee
2022-03-07 10:34                               ` Patrick Steinhardt
2022-03-07 13:45                                 ` Derrick Stolee
2022-03-07 17:22                                   ` Junio C Hamano
2022-03-10 13:58                                   ` Patrick Steinhardt
2022-03-10 17:18                                     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 4/7] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-24 22:35   ` Junio C Hamano
2022-02-25 13:53     ` Derrick Stolee
2022-02-25 17:38       ` Junio C Hamano
2022-02-24 20:38 ` [PATCH 5/7] commit-graph: document file format v2 Derrick Stolee via GitGitGadget
2022-02-24 22:55   ` Junio C Hamano
2022-02-25 22:31   ` Ævar Arnfjörð Bjarmason
2022-02-28 13:44     ` Derrick Stolee
2022-02-28 14:27       ` Ævar Arnfjörð Bjarmason
2022-02-28 16:39         ` Derrick Stolee
2022-02-28 21:14           ` Ævar Arnfjörð Bjarmason
2022-03-01 14:19             ` Derrick Stolee [this message]
2022-03-01 14:29               ` Ævar Arnfjörð Bjarmason
2022-03-01 15:59                 ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 6/7] commit-graph: parse " Derrick Stolee via GitGitGadget
2022-02-24 23:01   ` Junio C Hamano
2022-02-25 13:54     ` Derrick Stolee
2022-02-24 20:38 ` [PATCH 7/7] commit-graph: write " Derrick Stolee via GitGitGadget
2022-02-24 21:42 ` [PATCH 0/7] Commit-graph: Generation Number v2 Fixes, v3 implementation Junio C Hamano
2022-02-24 23:06   ` Junio C Hamano
2022-02-25 13:55     ` Derrick Stolee
2022-02-28 13:53 ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Derrick Stolee via GitGitGadget
2022-02-28 13:53   ` [PATCH v2 1/4] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-02-28 15:22     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 2/4] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-02-28 15:25     ` Ævar Arnfjörð Bjarmason
2022-02-28 13:53   ` [PATCH v2 3/4] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-02-28 15:30     ` Ævar Arnfjörð Bjarmason
2022-02-28 16:43       ` Derrick Stolee
2022-02-28 13:53   ` [PATCH v2 4/4] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget
2022-02-28 15:40     ` Ævar Arnfjörð Bjarmason
2022-03-01 17:23   ` [PATCH v2 0/4] Commit-graph: Generation Number v2 Fixes Ævar Arnfjörð Bjarmason
2022-03-01 19:48   ` [PATCH v3 0/5] " Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 1/5] test-read-graph: include extra post-parse info Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 2/5] t5318: extract helpers to lib-commit-graph.sh Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 3/5] commit-graph: fix ordering bug in generation numbers Derrick Stolee via GitGitGadget
2022-03-01 20:13       ` Junio C Hamano
2022-03-01 20:30         ` Junio C Hamano
2022-03-02 14:13           ` Derrick Stolee
2022-03-01 19:48     ` [PATCH v3 4/5] commit-graph: start parsing generation v2 (again) Derrick Stolee via GitGitGadget
2022-03-01 19:48     ` [PATCH v3 5/5] commit-graph: fix generation number v2 overflow values Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9b52fdd3-64fc-34b1-d4ee-660b4fb73f39@github.com \
    --to=derrickstolee@github.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.