git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Stability of git-archive, breaking (?) the Github universe, and a possible solution
@ 2023-01-31  0:06 Eli Schwartz
  2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
  0 siblings, 2 replies; 57+ messages in thread
From: Eli Schwartz @ 2023-01-31  0:06 UTC (permalink / raw)
  To: Git List; +Cc: brian m. carlson

For those that haven't seen, github changed its checksums for all
"source code" artifacts attached to any git repository with tags. This
change is now reverted due to widespread breakage -- and the lack of
advance warning. The technical details of the change appear simple: they
upgraded git.

Probably the main discussion, complete with Github employees from this
mailing list responding:

https://github.com/bazel-contrib/SIG-rules-authors/issues/11#issuecomment-1409438954

Consequences of that discussion, attempting to mitigate issues by
warning people that it already happened:

https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/

And where I first saw it: https://github.com/mesonbuild/wrapdb/pull/884

Historically speaking, git-archive has been stable minus... a bug fix or
two in rare cases, specifically relating to an inability to transcribe
the contents of the git repo at all, I think? And the other factor is
the compression algorithm used, which is generally GNU gzip, and
historically whatever the system `gzip` command is.

And gzip is a stable format. It's a worn-out, battle-weary format, even
-- it's not the best at compressing, and it's not the best at
decompressing, and "all the cool kids" are working on cooler formats,
such as zstd which does indeed regularly change its byte output between
versions. But the advantage of gzip is that it's good *enough*, and it's
probably *everywhere*, and it's *reliable*.

GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip
(this is relevant to the handful of people running software forges on,
say, Alpine Linux):

https://reproducible-builds.org/reports/2019-08/#upstream-news

...

Nevertheless, I've seen the sentiment a few times that git doesn't like
committing to output stability of git-archive, because it isn't
officially documented (but it's not entirely clear what the benefits of
changing are). And yet, git endeavors to do so, in order to prevent
unnecessary breakage of people who embody Hyrum's Law and need that
stability.

Even with the new change to the compressor, git-archive is still
reproducible, it's the internal gzip compressor that isn't. (This may be
fixable, possibly by embedding an implementation from busybox or from
GNU gzip? I'm not going to discuss that right now, though I think it's
an interesting avenue of exploration.)

I've thought about this now and then over the last couple of years,
because I think I have a reasonable compromise that might make everyone
(or at least most people) happy, and now seems like a good idea to
mention it.

What does everyone think about offering versioned git-archive outputs?
This could be user-selectable as an option to `git archive`, but the
main goal would be to select a good versioned output format depending on
what is being archived. So:

- first things first, un-default the internal compressor again
- implement a v2 archive format, where the internal compressor is the
  default -- no other changes
- teach git to select an archive format based on the date of the object
  being archived
  - when given a commit/tag ID to archive, check which support frame the
    committer date falls inside
  - for tree IDs, always use the latest format (it always uses the
    current date anyway)
- schedule a date, for the sake of argument, 6 months after the next
  scheduled release date of git version X.Y in which this change goes
  live; bake this into the git sources as a transition date, all commits
  or tags generated after this date fall into the next format support
  frame


The end result is that for all historic commits or tags, `git archive`
will always produce the same output. This can be documented in the
git-archive manpage: "the produced archive is guaranteed to be
reproducible, unless you override the `tar.<format>.command` or your
system compressor is not reproducible".

For *new* commits or tags, everyone gets the benefit of fascinating,
cool new archive formats with useful improvements at the tar container
level, which is apparently a very desirable feature. The git project no
longer has to worry, at all, about whether users will come to complain
about how their build pipelines suddenly fail with checksum issues. The
git project can simply, fearlessly, go implement innovative new changes
without giving any thought to backwards compatibility.

It is, simply, that those new changes only apply to projects which are
still under active development, and which push new commits or tag new
releases after the transition date.

Old states of existing projects (regardless of whether they are still
actively updating) can go have their old and apparently inefficient
archives and don't get cool new stuff. That's fine. They're also
increasingly rarely used, because they are, after all, old -- and most
likely only used for historic archival purposes. If the worst comes to
worst, well, they managed to produce a somehow useful archive with an
older version of git -- nothing will *break* if they don't get the cool
new stuff.

And for the vast majority of new downloads for new stuff, the in-process
compressor saves one fork+exec and is a bit more efficient, I guess?

A note on the transition date: I suggested 6 months after the scheduled
release date, because this gives everyone running a software forge time
to update git itself, and have everything ready, in time to handle the
first wave of commits and tags that naturally occur after the transition
date. And you don't want it to be immediate, because then people will
take days or weeks to deploy and the most recent archives will change


For the purposes of this thought experiment, we assume that people don't
routinely set the system time to a year in the future. This will only be
done in situations such as, say, testing a git upgrade deployment for a
software forge.

...


"And then no one ever complained about archive checksums changing again."

🤞🙏🥺

-- 
Eli Schwartz

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2023-02-12 17:41 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
2023-01-31  9:11   ` Eli Schwartz
2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
2023-02-02 10:25       ` brian m. carlson
2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
2023-02-02 16:34         ` Junio C Hamano
2023-02-04 17:46           ` brian m. carlson
2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
2023-02-02 16:40       ` Junio C Hamano
2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
2023-02-06 14:46         ` Phillip Wood
2023-02-03 15:47       ` Theodore Ts'o
2023-02-02 16:25     ` Junio C Hamano
2023-02-04 18:08       ` René Scharfe
2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
2023-02-12 17:41           ` René Scharfe
2023-02-02 19:23     ` Raymond E. Pasco
2023-02-03  8:06       ` [PATCH] archive: document output stability concerns Raymond E. Pasco
2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
2023-01-31 15:05   ` Konstantin Ryabitsev
2023-01-31 22:32     ` brian m. carlson
2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
2023-02-01 11:34         ` demerphq
2023-02-01 12:21           ` Michal Suchánek
2023-02-01 12:48             ` demerphq
2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
2023-02-01 15:21                 ` demerphq
2023-02-01 18:56                   ` Theodore Ts'o
2023-02-02 21:19                     ` Joey Hess
2023-02-03  4:02                       ` Theodore Ts'o
2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
2023-02-01 23:16         ` brian m. carlson
2023-02-01 23:37           ` Junio C Hamano
2023-02-02 23:01             ` brian m. carlson
2023-02-02 23:47               ` rsbecker
2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason
2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
2023-02-01 12:17       ` Raymond E. Pasco
2023-01-31 15:56   ` Eli Schwartz
2023-01-31 16:20     ` Konstantin Ryabitsev
2023-01-31 16:34       ` Eli Schwartz
2023-01-31 20:34         ` Konstantin Ryabitsev
2023-01-31 20:45         ` Michal Suchánek
2023-02-01  1:33     ` brian m. carlson
2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
2023-02-01 23:18     ` brian m. carlson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).